DeepSeek Releases V4 with 1M-Token Context for Agent Workloads
DeepSeek has launched the V4 model series, featuring a one-million-token context window and massive cost reductions for long-running AI agent workflows.
DeepSeek launched the preview release of its DeepSeek-V4 model series, introducing a native one-million-token context window optimized for agent workloads. As detailed in the DeepSeek-V4 release, the update targets the computational and memory barriers that typically cause agents to fail during long-running tasks. The models are available now under an open-source license on Hugging Face, alongside an updated API routing framework.
Model Architecture and Variants
The V4 series ships in two primary Mixture-of-Experts (MoE) variants. DeepSeek-V4-Pro operates as the flagship model with 1.6 trillion total parameters, activating 49 billion parameters per forward pass. DeepSeek-V4-Flash provides a streamlined alternative with 284 billion total parameters, activating 13 billion.
| Model | Total Parameters | Active Parameters | Context Window |
|---|---|---|---|
| DeepSeek-V4-Pro | 1.6T | 49B | 1,000,000 |
| DeepSeek-V4-Flash | 284B | 13B | 1,000,000 |
If you deploy models across multi-agent systems, understanding the architectural changes helps explain the performance gains. The models utilize Hybrid Attention, an interleaved mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA groups tokens to select top-k relevance. HCA applies high-ratio dense compression to manage long-term dependencies. The training phase also introduced Manifold-Constrained Hyper-Connections (mHC) to prevent gradient explosion at the trillion-parameter scale, utilizing a specialized Muon Optimizer.
These changes yield significant efficiency improvements over DeepSeek-V3.2. V4-Pro reduces single-token inference FLOPs to 27% and KV cache memory requirements to 10% at a 1M context. V4-Flash requires only 10% of the FLOPs and 7% of the KV cache compared to the previous generation.
Benchmarks and Reasoning Capabilities
DeepSeek-V4-Pro achieves 90.2% on the Apex Shortlist benchmark and holds a Codeforces Rating of 3206. On world knowledge evaluations, the model trails Google’s Gemini-3.1-Pro and Anthropic’s Claude Opus 4.6.
The models expose three distinct reasoning modes at inference time: Non-think, Think High, and Think Max. Developers can toggle these modes to trade inference speed for reasoning depth depending on the complexity of the task.
Hardware Optimization and API Pricing
DeepSeek optimized V4 for Huawei Ascend 950 AI chips and the Ascend supernode architecture. The API supports direct calls to deepseek-v4-pro and deepseek-v4-flash. Legacy endpoints like deepseek-chat and deepseek-reasoner currently route to V4-Flash and will be retired on July 24, 2026.
The pricing structure aggressively targets reducing LLM API costs for high-volume processing. The V4-Flash API costs 1 yuan per million tokens for input (cache miss) and 2 yuan per million tokens for output. Processing a massive document of one million tokens costs approximately $0.14 to $0.28 USD.
If your production systems rely on the legacy DeepSeek models, update your API routing logic before the July retirement date. For teams building autonomous workflows, you should immediately evaluate and test your agents against V4-Flash to determine if the 1M context window and lower pricing allow you to replace complex retrieval architectures with raw context ingestion.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Build AI Agent Search with Cloudflare AI Search
Learn how to use Cloudflare AI Search to simplify RAG pipelines with hybrid vector search, automated indexing, and native MCP support for AI agents.
Anthropic Adds Native Memory to Claude Managed Agents
Anthropic released a built-in memory layer for Claude Managed Agents, enabling cross-session persistence via a mounted filesystem.
Meta’s KernelEvolve Agent Cuts AI Kernel Dev from Weeks to Hours
Meta introduces KernelEvolve, an agentic AI system that autonomously optimizes high-performance kernels, boosting ads model inference throughput by 60%.
OpenAI launches GPT-5.5 and unified desktop agent application
OpenAI released its GPT-5.5 frontier model alongside a new unified desktop application that merges ChatGPT, Codex, and Atlas for agentic workflows.
Google Launches Workspace Intelligence and Workspace MCP Server
Google announced a new agentic AI layer for Workspace apps alongside hardware updates and a Model Context Protocol server for developers.