DeepSeek Releases V4 with 1M-Token Context for Agent Workloads

DeepSeek launched the preview release of its DeepSeek-V4 model series, introducing a native one-million-token context window optimized for agent workloads. As detailed in the DeepSeek-V4 release, the update targets the computational and memory barriers that typically cause agents to fail during long-running tasks. The models are available now under an open-source license on Hugging Face, alongside an updated API routing framework.

Model Architecture and Variants

The V4 series ships in two primary Mixture-of-Experts (MoE) variants. DeepSeek-V4-Pro operates as the flagship model with 1.6 trillion total parameters, activating 49 billion parameters per forward pass. DeepSeek-V4-Flash provides a streamlined alternative with 284 billion total parameters, activating 13 billion.

Model	Total Parameters	Active Parameters	Context Window
DeepSeek-V4-Pro	1.6T	49B	1,000,000
DeepSeek-V4-Flash	284B	13B	1,000,000

If you deploy models across multi-agent systems, understanding the architectural changes helps explain the performance gains. The models utilize Hybrid Attention, an interleaved mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA groups tokens to select top-k relevance. HCA applies high-ratio dense compression to manage long-term dependencies. The training phase also introduced Manifold-Constrained Hyper-Connections (mHC) to prevent gradient explosion at the trillion-parameter scale, utilizing a specialized Muon Optimizer.

These changes yield significant efficiency improvements over DeepSeek-V3.2. V4-Pro reduces single-token inference FLOPs to 27% and KV cache memory requirements to 10% at a 1M context. V4-Flash requires only 10% of the FLOPs and 7% of the KV cache compared to the previous generation.

Benchmarks and Reasoning Capabilities

DeepSeek-V4-Pro achieves 90.2% on the Apex Shortlist benchmark and holds a Codeforces Rating of 3206. On world knowledge evaluations, the model trails Google’s Gemini-3.1-Pro and Anthropic’s Claude Opus 4.6.

The models expose three distinct reasoning modes at inference time: Non-think, Think High, and Think Max. Developers can toggle these modes to trade inference speed for reasoning depth depending on the complexity of the task.

Hardware Optimization and API Pricing

DeepSeek optimized V4 for Huawei Ascend 950 AI chips and the Ascend supernode architecture. The API supports direct calls to deepseek-v4-pro and deepseek-v4-flash. Legacy endpoints like deepseek-chat and deepseek-reasoner currently route to V4-Flash and will be retired on July 24, 2026.

The pricing structure aggressively targets reducing LLM API costs for high-volume processing. The V4-Flash API costs 1 yuan per million tokens for input (cache miss) and 2 yuan per million tokens for output. Processing a massive document of one million tokens costs approximately $0.14 to $0.28 USD.

If your production systems rely on the legacy DeepSeek models, update your API routing logic before the July retirement date. For teams building autonomous workflows, you should immediately evaluate and test your agents against V4-Flash to determine if the 1M context window and lower pricing allow you to replace complex retrieval architectures with raw context ingestion.

DeepSeek Releases V4 with 1M-Token Context for Agent Workloads

Model Architecture and Variants

Benchmarks and Reasoning Capabilities

Hardware Optimization and API Pricing

Keep Reading

Build AI Agent Search with Cloudflare AI Search

Anthropic Adds Native Memory to Claude Managed Agents

Meta’s KernelEvolve Agent Cuts AI Kernel Dev from Weeks to Hours

OpenAI launches GPT-5.5 and unified desktop agent application

Google Launches Workspace Intelligence and Workspace MCP Server