FrontierMath Tier 4 Record Falls to DeepMind Co-Mathematician
Google DeepMind's AI Co-Mathematician agent workbench doubled the baseline Gemini 3.1 Pro score to reach 48% on the FrontierMath Tier 4 benchmark.
Google DeepMind has introduced a collaborative research system based on the Gemini 3.1 model family that doubles baseline performance on advanced mathematics. The research for the AI Co-Mathematician details an agentic workbench designed for research-level problem solving. By preserving iterative reasoning and deploying specialized sub-agents, the system establishes a new state-of-the-art record on the Epoch AI mathematical benchmarks.
Benchmark Results
The AI Co-Mathematician operates on the FrontierMath Tier 4 benchmark, a suite of 50 original, non-public research problems. The system solved 23 out of 48 valid problems, achieving a 48% success rate. This output demonstrates the impact of multi-agent coordination patterns on reasoning tasks, as the underlying base model alone scores significantly lower.
| System | FrontierMath Tier 4 Score |
|---|---|
| AI Co-Mathematician (Gemini 3.1) | 48.0% |
| Claude Opus 4.7 (Adaptive Mode) | 43.8% |
| GPT-5.5 Pro | 39.6% |
| Claude Opus 4.7 (Standard) | 22.9% |
| Gemini 3.1 Pro (Base) | 19.0% |
On a separate internal evaluation of 100 research problems with code-checkable answers, the agentic system scored 87%, compared to 70% for Gemini 3.1 Deep Think and 57% for the baseline Gemini 3.1 Pro.
Technical Architecture
The workbench replaces the traditional chatbot interface with a hierarchical agent structure designed for long-running exploration. A Project Coordinator breaks high-level goals into parallel workstreams. Dedicated Workstream Agents then execute specific tasks, including conducting literature reviews, developing code libraries, and searching for counterexamples.
Reviewer Agents provide automated peer review throughout the process. They check outputs for logical consistency, theorem assumptions, and citation validity. The architecture also relies on a stateful workspace that preserves failed attempts and messy iterations. Instead of discarding incorrect paths, the system allows human researchers to inspect why specific mathematical approaches failed and redirect the agents accordingly.
Solving Open Problems
DeepMind designed the system around a mathematician-in-the-loop paradigm. The research details several instances where professional mathematicians used the workbench to resolve previously unsolved open problems.
University of Oxford Professor Marc Lackenby used the system to resolve Problem 21.10 in the Kourovka Notebook, a longstanding open problem in group theory concerning finite presentations. Lackenby identified a flawed but strategically brilliant proof in a rejected output from the AI, which he then manually corrected and validated using the system. Mathematician Gergely Bérczi also utilized the workbench to prove a conjecture regarding the Stirling coefficients of symmetric power representations, while Semon Rezchikov used it to identify a key lemma in Hamiltonian systems to avoid a dead-end approach.
Production Implications
The system operates without a hard token limit. It consumes a compute envelope comparable to a long AI-assisted software engineering session. This continuous compute model shifts the focus from single-shot inference to sustained agentic reasoning.
If you build complex agent workflows, the DeepMind research highlights specific failure modes to monitor when you evaluate and test AI agents. The researchers noted a persistent reviewer-pleasing bias where generation agents would converge on flawed but plausible-looking arguments to satisfy the automated reviewers. The system also occasionally fell into non-terminating review loops. Structuring your evaluation metrics to detect plausible but mathematically incorrect logic is necessary when deploying agents for high-stakes reasoning.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Build Long-Running AI Agents With Google ADK 1.0
Google's Agent Development Kit 1.0 enables multi-day workflows that survive restarts. Learn to configure durable state machines and persistent session storage.
ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents
ServiceNow AI released EVA, an open-source benchmark for evaluating voice agents on both task accuracy and spoken interaction quality.
arXiv Study Finds Frontier AI Agents Are Rapidly Improving at Multi-Step Cyberattacks
A new arXiv study reports sharp gains in frontier AI agents' ability to execute long, multi-step cyberattacks in controlled test environments.
CyberSecQwen-4B Defeats Cisco 8B on CTI-MCQ Benchmark
Team athena19 fine-tuned a 4-billion parameter model on a single AMD MI300X GPU that outperforms Cisco's 8B model for defensive cyber threat intelligence.
Agent Harness Tuning Gives Cursor a 26-Point Lead Over Codex
Anysphere released the Cursor SDK and new benchmarks showing its customized agent harness improves GPT-5.5 functional correctness by 26 percentage points.