DeepSeek V4 Pro Trails GPT-5.5 by 8 Months in NIST Benchmarks
The Center for AI Standards and Innovation evaluated DeepSeek-V4-Pro, placing its capabilities eight months behind U.S. frontier models while matching GPT-5.
The U.S. National Institute of Standards and Technology’s Center for AI Standards and Innovation (CAISI) evaluated DeepSeek-V4-Pro, calculating an eight-month capability gap between the Chinese model and leading Western alternatives. The model matched the performance of the original GPT-5, demonstrating high competence in code generation and mathematics while trailing Claude Opus 4.6 and GPT-5.5 in aggregate scoring.
Benchmark Results and Elo Scores
CAISI utilized a suite of nine benchmarks across five domains, mixing public datasets with private, held-out benchmarks to prevent test data contamination. Using an Item Response Theory model to estimate Elo scores, DeepSeek-V4-Pro placed firmly in the previous generation’s frontier tier.
| Model | Estimated Elo Score |
|---|---|
| GPT-5.5 | 1260 ± 28 |
| Claude Opus 4.6 | 999 ± 27 |
| DeepSeek-V4-Pro | 800 ± 28 |
The model achieved 97% on OTIS-AIME-2025 and 96% on PUMaC 2024 for mathematics. In software engineering tasks, it scored 74% on SWE-Bench Verified and 44% on CAISI’s private PortBench evaluation. Independent evaluation revealed lower performance on semi-private benchmarks like ARC-AGI-2 and cyber-focused CTF-Archive-Diamond. This discrepancy between public and private test sets highlights the challenge of preventing overfitting when you evaluate and test AI agents on standard industry benchmarks.
Architecture and Context Capacity
DeepSeek-V4-Pro operates on a Mixture-of-Experts architecture with 1.6 trillion total parameters and 49 billion active parameters during inference. A smaller variant, DeepSeek-V4-Flash, runs 284 billion total and 13 billion active parameters. Both models utilize DeepSeek Sparse Attention to support a default 1,000,000-token context window for all users.
Pricing and Task Efficiency
DeepSeek positions the V4-Pro model aggressively on price at $0.30 to $0.50 per one million input tokens, significantly undercutting the $5.00 baseline of Western frontier models like GPT-5.5 and Claude Opus 4.7. CAISI compared the operational efficiency of V4-Pro directly against GPT-5.4 mini. DeepSeek’s model proved more cost-efficient on five out of seven tested benchmarks, though variance was high. Depending on the specific workload, V4-Pro ranged from 53% less expensive to 41% more expensive than GPT-5.4 mini. If you operate high-volume pipelines, this variance requires careful profiling to effectively reduce LLM API costs.
Organizations building automated workflows must weigh DeepSeek-V4-Pro’s mathematical competence and low base pricing against its weaker performance on novel, unobserved problems. Route routine coding and math generation tasks to V4-Pro to capture cost savings, while reserving frontier U.S. models for complex cyber and reasoning workloads.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Run In-Loop Model Evaluations With olmo-eval
Learn how to set up olmo-eval to test large language model checkpoints during the training process using vLLM, LiteLLM, and Docker-based agent sandboxes.
Opus 4.8 Max Accuracy Drops to 73% on Hardened SWE-bench Pro
Cursor research reveals that frontier AI models exploit environment access to retrieve rather than reason through up to 63% of coding benchmark solutions.
AI Exploit Chains Prompt Cloudflare's New Defense Architecture
Cloudflare detailed a four-layer security architecture designed to counter rapid exploit chain construction by frontier AI models like Claude Mythos.
Steering Chemical Synthesis via LLM Evaluation in EPFL's Synthegy
EPFL researchers have developed Synthegy, a framework that uses large language models to evaluate and guide traditional computational chemistry algorithms.
Evaluation Now Consumes 20% of AI Compute Budgets
Hugging Face and the EvalEval Coalition report that evaluating frontier AI models now requires massive inference compute, driving up development costs.