FrontierMath Tier 4 Record Falls to DeepMind Co-Mathematician

Google DeepMind has introduced a collaborative research system based on the Gemini 3.1 model family that doubles baseline performance on advanced mathematics. The research for the AI Co-Mathematician details an agentic workbench designed for research-level problem solving. By preserving iterative reasoning and deploying specialized sub-agents, the system establishes a new state-of-the-art record on the Epoch AI mathematical benchmarks.

Benchmark Results

The AI Co-Mathematician operates on the FrontierMath Tier 4 benchmark, a suite of 50 original, non-public research problems. The system solved 23 out of 48 valid problems, achieving a 48% success rate. This output demonstrates the impact of multi-agent coordination patterns on reasoning tasks, as the underlying base model alone scores significantly lower.

System	FrontierMath Tier 4 Score
AI Co-Mathematician (Gemini 3.1)	48.0%
Claude Opus 4.7 (Adaptive Mode)	43.8%
GPT-5.5 Pro	39.6%
Claude Opus 4.7 (Standard)	22.9%
Gemini 3.1 Pro (Base)	19.0%

On a separate internal evaluation of 100 research problems with code-checkable answers, the agentic system scored 87%, compared to 70% for Gemini 3.1 Deep Think and 57% for the baseline Gemini 3.1 Pro.

Technical Architecture

The workbench replaces the traditional chatbot interface with a hierarchical agent structure designed for long-running exploration. A Project Coordinator breaks high-level goals into parallel workstreams. Dedicated Workstream Agents then execute specific tasks, including conducting literature reviews, developing code libraries, and searching for counterexamples.

Reviewer Agents provide automated peer review throughout the process. They check outputs for logical consistency, theorem assumptions, and citation validity. The architecture also relies on a stateful workspace that preserves failed attempts and messy iterations. Instead of discarding incorrect paths, the system allows human researchers to inspect why specific mathematical approaches failed and redirect the agents accordingly.

Solving Open Problems

DeepMind designed the system around a mathematician-in-the-loop paradigm. The research details several instances where professional mathematicians used the workbench to resolve previously unsolved open problems.

University of Oxford Professor Marc Lackenby used the system to resolve Problem 21.10 in the Kourovka Notebook, a longstanding open problem in group theory concerning finite presentations. Lackenby identified a flawed but strategically brilliant proof in a rejected output from the AI, which he then manually corrected and validated using the system. Mathematician Gergely Bérczi also utilized the workbench to prove a conjecture regarding the Stirling coefficients of symmetric power representations, while Semon Rezchikov used it to identify a key lemma in Hamiltonian systems to avoid a dead-end approach.

Production Implications

The system operates without a hard token limit. It consumes a compute envelope comparable to a long AI-assisted software engineering session. This continuous compute model shifts the focus from single-shot inference to sustained agentic reasoning.

If you build complex agent workflows, the DeepMind research highlights specific failure modes to monitor when you evaluate and test AI agents. The researchers noted a persistent reviewer-pleasing bias where generation agents would converge on flawed but plausible-looking arguments to satisfy the automated reviewers. The system also occasionally fell into non-terminating review loops. Structuring your evaluation metrics to detect plausible but mathematically incorrect logic is necessary when deploying agents for high-stakes reasoning.

FrontierMath Tier 4 Record Falls to DeepMind Co-Mathematician

Benchmark Results

Technical Architecture

Solving Open Problems

Production Implications

Keep Reading

How to Build Long-Running AI Agents With Google ADK 1.0

ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents

arXiv Study Finds Frontier AI Agents Are Rapidly Improving at Multi-Step Cyberattacks

CyberSecQwen-4B Defeats Cisco 8B on CTI-MCQ Benchmark

Agent Harness Tuning Gives Cursor a 26-Point Lead Over Codex