DeepSeek V4 Pro Trails GPT-5.5 by 8 Months in NIST Benchmarks

The U.S. National Institute of Standards and Technology’s Center for AI Standards and Innovation (CAISI) evaluated DeepSeek-V4-Pro, calculating an eight-month capability gap between the Chinese model and leading Western alternatives. The model matched the performance of the original GPT-5, demonstrating high competence in code generation and mathematics while trailing Claude Opus 4.6 and GPT-5.5 in aggregate scoring.

Benchmark Results and Elo Scores

CAISI utilized a suite of nine benchmarks across five domains, mixing public datasets with private, held-out benchmarks to prevent test data contamination. Using an Item Response Theory model to estimate Elo scores, DeepSeek-V4-Pro placed firmly in the previous generation’s frontier tier.

Model	Estimated Elo Score
GPT-5.5	1260 ± 28
Claude Opus 4.6	999 ± 27
DeepSeek-V4-Pro	800 ± 28

The model achieved 97% on OTIS-AIME-2025 and 96% on PUMaC 2024 for mathematics. In software engineering tasks, it scored 74% on SWE-Bench Verified and 44% on CAISI’s private PortBench evaluation. Independent evaluation revealed lower performance on semi-private benchmarks like ARC-AGI-2 and cyber-focused CTF-Archive-Diamond. This discrepancy between public and private test sets highlights the challenge of preventing overfitting when you evaluate and test AI agents on standard industry benchmarks.

Architecture and Context Capacity

DeepSeek-V4-Pro operates on a Mixture-of-Experts architecture with 1.6 trillion total parameters and 49 billion active parameters during inference. A smaller variant, DeepSeek-V4-Flash, runs 284 billion total and 13 billion active parameters. Both models utilize DeepSeek Sparse Attention to support a default 1,000,000-token context window for all users.

Pricing and Task Efficiency

DeepSeek positions the V4-Pro model aggressively on price at $0.30 to $0.50 per one million input tokens, significantly undercutting the $5.00 baseline of Western frontier models like GPT-5.5 and Claude Opus 4.7. CAISI compared the operational efficiency of V4-Pro directly against GPT-5.4 mini. DeepSeek’s model proved more cost-efficient on five out of seven tested benchmarks, though variance was high. Depending on the specific workload, V4-Pro ranged from 53% less expensive to 41% more expensive than GPT-5.4 mini. If you operate high-volume pipelines, this variance requires careful profiling to effectively reduce LLM API costs.

Organizations building automated workflows must weigh DeepSeek-V4-Pro’s mathematical competence and low base pricing against its weaker performance on novel, unobserved problems. Route routine coding and math generation tasks to V4-Pro to capture cost savings, while reserving frontier U.S. models for complex cyber and reasoning workloads.

DeepSeek V4 Pro Trails GPT-5.5 by 8 Months in NIST Benchmarks

Benchmark Results and Elo Scores

Architecture and Context Capacity

Pricing and Task Efficiency

Keep Reading

How to Run In-Loop Model Evaluations With olmo-eval

Opus 4.8 Max Accuracy Drops to 73% on Hardened SWE-bench Pro

AI Exploit Chains Prompt Cloudflare's New Defense Architecture

Steering Chemical Synthesis via LLM Evaluation in EPFL's Synthegy

Evaluation Now Consumes 20% of AI Compute Budgets