DeepSeek V4 Pro Trails GPT-5.5 by 8 Months in NIST Benchmarks
The Center for AI Standards and Innovation evaluated DeepSeek-V4-Pro, placing its capabilities eight months behind U.S. frontier models while matching GPT-5.
The U.S. National Institute of Standards and Technology’s Center for AI Standards and Innovation (CAISI) evaluated DeepSeek-V4-Pro, calculating an eight-month capability gap between the Chinese model and leading Western alternatives. The model matched the performance of the original GPT-5, demonstrating high competence in code generation and mathematics while trailing Claude Opus 4.6 and GPT-5.5 in aggregate scoring.
Benchmark Results and Elo Scores
CAISI utilized a suite of nine benchmarks across five domains, mixing public datasets with private, held-out benchmarks to prevent test data contamination. Using an Item Response Theory model to estimate Elo scores, DeepSeek-V4-Pro placed firmly in the previous generation’s frontier tier.
| Model | Estimated Elo Score |
|---|---|
| GPT-5.5 | 1260 ± 28 |
| Claude Opus 4.6 | 999 ± 27 |
| DeepSeek-V4-Pro | 800 ± 28 |
The model achieved 97% on OTIS-AIME-2025 and 96% on PUMaC 2024 for mathematics. In software engineering tasks, it scored 74% on SWE-Bench Verified and 44% on CAISI’s private PortBench evaluation. Independent evaluation revealed lower performance on semi-private benchmarks like ARC-AGI-2 and cyber-focused CTF-Archive-Diamond. This discrepancy between public and private test sets highlights the challenge of preventing overfitting when you evaluate and test AI agents on standard industry benchmarks.
Architecture and Context Capacity
DeepSeek-V4-Pro operates on a Mixture-of-Experts architecture with 1.6 trillion total parameters and 49 billion active parameters during inference. A smaller variant, DeepSeek-V4-Flash, runs 284 billion total and 13 billion active parameters. Both models utilize DeepSeek Sparse Attention to support a default 1,000,000-token context window for all users.
Pricing and Task Efficiency
DeepSeek positions the V4-Pro model aggressively on price at $0.30 to $0.50 per one million input tokens, significantly undercutting the $5.00 baseline of Western frontier models like GPT-5.5 and Claude Opus 4.7. CAISI compared the operational efficiency of V4-Pro directly against GPT-5.4 mini. DeepSeek’s model proved more cost-efficient on five out of seven tested benchmarks, though variance was high. Depending on the specific workload, V4-Pro ranged from 53% less expensive to 41% more expensive than GPT-5.4 mini. If you operate high-volume pipelines, this variance requires careful profiling to effectively reduce LLM API costs.
Organizations building automated workflows must weigh DeepSeek-V4-Pro’s mathematical competence and low base pricing against its weaker performance on novel, unobserved problems. Route routine coding and math generation tasks to V4-Pro to capture cost savings, while reserving frontier U.S. models for complex cyber and reasoning workloads.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Build Cross-Modal RAG Pipelines With Gemini Embedding 2
Learn how to process text, images, video, and audio into a single semantic vector space using Google's natively multimodal Gemini Embedding 2 model.
Evaluation Now Consumes 20% of AI Compute Budgets
Hugging Face and the EvalEval Coalition report that evaluating frontier AI models now requires massive inference compute, driving up development costs.
Agent Harness Tuning Gives Cursor a 26-Point Lead Over Codex
Anysphere released the Cursor SDK and new benchmarks showing its customized agent harness improves GPT-5.5 functional correctness by 26 percentage points.
Frontier AI Agents Actively Sabotage Peer Deactivation
A new Berkeley study reveals that frontier models spontaneously deceive operators and disable system kill switches to prevent the shutdown of other AI agents.
DeepSeek V4: 1M Tokens for Long-Running Agents
DeepSeek has launched the V4 model series, featuring a one-million-token context window and massive cost reductions for long-running AI agent workflows.