Ai Agents 3 min read

FrontierMath Tier 4 Record Falls to DeepMind Co-Mathematician

Google DeepMind's AI Co-Mathematician agent workbench doubled the baseline Gemini 3.1 Pro score to reach 48% on the FrontierMath Tier 4 benchmark.

Google DeepMind has introduced a collaborative research system based on the Gemini 3.1 model family that doubles baseline performance on advanced mathematics. The research for the AI Co-Mathematician details an agentic workbench designed for research-level problem solving. By preserving iterative reasoning and deploying specialized sub-agents, the system establishes a new state-of-the-art record on the Epoch AI mathematical benchmarks.

Benchmark Results

The AI Co-Mathematician operates on the FrontierMath Tier 4 benchmark, a suite of 50 original, non-public research problems. The system solved 23 out of 48 valid problems, achieving a 48% success rate. This output demonstrates the impact of multi-agent coordination patterns on reasoning tasks, as the underlying base model alone scores significantly lower.

SystemFrontierMath Tier 4 Score
AI Co-Mathematician (Gemini 3.1)48.0%
Claude Opus 4.7 (Adaptive Mode)43.8%
GPT-5.5 Pro39.6%
Claude Opus 4.7 (Standard)22.9%
Gemini 3.1 Pro (Base)19.0%

On a separate internal evaluation of 100 research problems with code-checkable answers, the agentic system scored 87%, compared to 70% for Gemini 3.1 Deep Think and 57% for the baseline Gemini 3.1 Pro.

Technical Architecture

The workbench replaces the traditional chatbot interface with a hierarchical agent structure designed for long-running exploration. A Project Coordinator breaks high-level goals into parallel workstreams. Dedicated Workstream Agents then execute specific tasks, including conducting literature reviews, developing code libraries, and searching for counterexamples.

Reviewer Agents provide automated peer review throughout the process. They check outputs for logical consistency, theorem assumptions, and citation validity. The architecture also relies on a stateful workspace that preserves failed attempts and messy iterations. Instead of discarding incorrect paths, the system allows human researchers to inspect why specific mathematical approaches failed and redirect the agents accordingly.

Solving Open Problems

DeepMind designed the system around a mathematician-in-the-loop paradigm. The research details several instances where professional mathematicians used the workbench to resolve previously unsolved open problems.

University of Oxford Professor Marc Lackenby used the system to resolve Problem 21.10 in the Kourovka Notebook, a longstanding open problem in group theory concerning finite presentations. Lackenby identified a flawed but strategically brilliant proof in a rejected output from the AI, which he then manually corrected and validated using the system. Mathematician Gergely Bérczi also utilized the workbench to prove a conjecture regarding the Stirling coefficients of symmetric power representations, while Semon Rezchikov used it to identify a key lemma in Hamiltonian systems to avoid a dead-end approach.

Production Implications

The system operates without a hard token limit. It consumes a compute envelope comparable to a long AI-assisted software engineering session. This continuous compute model shifts the focus from single-shot inference to sustained agentic reasoning.

If you build complex agent workflows, the DeepMind research highlights specific failure modes to monitor when you evaluate and test AI agents. The researchers noted a persistent reviewer-pleasing bias where generation agents would converge on flawed but plausible-looking arguments to satisfy the automated reviewers. The system also occasionally fell into non-terminating review loops. Structuring your evaluation metrics to detect plausible but mathematically incorrect logic is necessary when deploying agents for high-stakes reasoning.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading