Google DeepMind Unveils AGI Cognitive Evaluation Framework and Launches $200,000 Kaggle Hackathon
Google DeepMind introduced a 10-faculty framework for measuring AGI progress and opened a $200,000 Kaggle evaluation hackathon.
Google DeepMind published a new AGI evaluation framework on March 17 and paired it with a $200,000 Kaggle competition to build missing cognitive benchmarks. The release centers on a new cognitive framework for measuring AGI progress, a capability-focused measurement scheme that matters if you evaluate frontier models, build agents, or rely on benchmark scores to make product decisions.
Cognitive taxonomy
DeepMind organizes AGI-related measurement into 10 cognitive faculties: Perception, Generation, Attention, Learning, Memory, Reasoning, Metacognition, Executive functions, Problem solving, and Social cognition.
The important design choice is scope. This framework measures what a system can do, not the mechanism it uses to do it. For developers, that aligns more closely with deployment reality. If your application depends on planning, memory retention, or social reasoning, the operational question is capability under test conditions, not architectural purity.
The paper goes beyond the high-level labels. Generation includes text, audio, action, and thought generation. Learning includes concept formation, associative learning, reinforcement learning, observational learning, procedural learning, and language learning. This is much closer to a cognitive profile than a leaderboard score.
Evaluation protocol
DeepMind proposes a three-stage protocol:
| Stage | Purpose |
|---|---|
| Broad held-out cognitive tasks | Measure targeted abilities while reducing contamination |
| Human baselines on the same tasks | Create a direct comparison set |
| Mapping against the human distribution | Show relative strengths and weaknesses per faculty |
Two details matter technically.
First, the framework emphasizes held-out test sets and recommends independent third-party verification. If you work on evaluating AI agents, this reinforces a familiar lesson: benchmark integrity matters more as frontier models absorb public datasets and common eval patterns.
Second, DeepMind does not push a single AGI number. It proposes a cognitive profile relative to human performance distribution. That is a better fit for modern systems, which are already uneven. Models can be strong in coding or reasoning and still weak in learning over time, metacognition, or social cognition.
Human-relative measurement
The framework anchors comparisons to a demographically representative sample of adults with at least the equivalent of an upper secondary education.
This is a notable shift from benchmark culture that often treats abstract task accuracy as sufficient. Human baselines introduce variance, distributional context, and a more legible way to discuss system capability. If you build products for end users rather than benchmark leaderboards, relative-to-human performance is often the comparison your stakeholders actually care about.
It also sharpens evaluation for systems with memory and adaptation. DeepMind explicitly argues that robust AI should be able to learn and retain new knowledge and skills over time, not only during pretraining or via in-context learning. That connects directly to production work on agent memory and the broader discipline of context engineering.
Kaggle competition structure
DeepMind is using Kaggle’s Community Benchmarks product as the implementation layer for the public part of this effort. The competition opened March 17, accepts submissions through April 16, and lists results for June 1.
The prize structure is substantial enough to attract serious participation:
| Competition element | Details |
|---|---|
| Submission window | March 17 to April 16, 2026 |
| Results announcement | June 1, 2026 |
| Total prize pool | $200,000 |
| Track prizes | $10,000 for the top two submissions in each of five tracks |
| Grand prizes | Four awards of $25,000 |
The five target areas are Learning, Metacognition, Attention, Executive functions, and Social cognition. DeepMind identifies these as the biggest current evaluation gaps.
This choice is revealing. Benchmarks for math, code, and static reasoning are already crowded. The missing layer is persistent learning, self-monitoring, task control, and socially situated judgment, exactly the capabilities that start to matter when you move from chatbot demos to long-running agents. If your team builds AI agents or compares agent frameworks, these are the failure modes that conventional evals often miss.
Position in DeepMind’s AGI work
This framework extends DeepMind’s earlier “Levels of AGI” framing with a more operational measurement layer. The practical change is that AGI discussion gets translated into test design, human comparison, and benchmark governance.
There is no model launch here, and no single score for Gemini or any other system. The release is about measurement infrastructure. That matters because evaluation standards shape which capabilities get optimized, reported, and funded.
If you run model evals, the useful move is to stop treating intelligence as one leaderboard column. Build capability profiles, include human baselines where feasible, and add tests for learning, metacognition, and executive control before those gaps become production incidents.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Run In-Loop Model Evaluations With olmo-eval
Learn how to set up olmo-eval to test large language model checkpoints during the training process using vLLM, LiteLLM, and Docker-based agent sandboxes.
Google Drops Vision Encoders in Gemma 4 12B Multimodal Release
Google DeepMind's new 12-billion parameter model uses a unified architecture to process text, image, and native audio directly on laptops with 16GB of RAM.
Google Ships 9 Gemini Omni Demos Alongside 3.5 Flash
Google has released nine demonstration videos showcasing Gemini Omni's physics-aware video generation and the benchmark results for Gemini 3.5 Flash.
Decoupled DiLoCo, Training Across Regions Without Lockstep
Google DeepMind's Decoupled DiLoCo architecture allows asynchronous AI training across geographically distant compute clusters with mixed TPU hardware.
DeepMind's Alignment Bet: More Test-Time Compute
Google DeepMind researchers have published a study demonstrating that video and language model alignment dramatically improves through test-time scaling.