Google DeepMind Unveils AGI Cognitive Evaluation Framework and Launches $200,000 Kaggle Hackathon
Google DeepMind introduced a 10-faculty framework for measuring AGI progress and opened a $200,000 Kaggle evaluation hackathon.
Google DeepMind published a new AGI evaluation framework on March 17 and paired it with a $200,000 Kaggle competition to build missing cognitive benchmarks. The release centers on Measuring progress toward AGI: A cognitive framework, a capability-focused measurement scheme that matters if you evaluate frontier models, build agents, or rely on benchmark scores to make product decisions.
Cognitive taxonomy
DeepMind organizes AGI-related measurement into 10 cognitive faculties: Perception, Generation, Attention, Learning, Memory, Reasoning, Metacognition, Executive functions, Problem solving, and Social cognition.
The important design choice is scope. This framework measures what a system can do, not the mechanism it uses to do it. For developers, that aligns more closely with deployment reality. If your application depends on planning, memory retention, or social reasoning, the operational question is capability under test conditions, not architectural purity.
The paper goes beyond the high-level labels. Generation includes text, audio, action, and thought generation. Learning includes concept formation, associative learning, reinforcement learning, observational learning, procedural learning, and language learning. This is much closer to a cognitive profile than a leaderboard score.
Evaluation protocol
DeepMind proposes a three-stage protocol:
| Stage | Purpose |
|---|---|
| Broad held-out cognitive tasks | Measure targeted abilities while reducing contamination |
| Human baselines on the same tasks | Create a direct comparison set |
| Mapping against the human distribution | Show relative strengths and weaknesses per faculty |
Two details matter technically.
First, the framework emphasizes held-out test sets and recommends independent third-party verification. If you work on How to Evaluate and Test AI Agents, this reinforces a familiar lesson: benchmark integrity matters more as frontier models absorb public datasets and common eval patterns.
Second, DeepMind does not push a single AGI number. It proposes a cognitive profile relative to human performance distribution. That is a better fit for modern systems, which are already uneven. Models can be strong in coding or reasoning and still weak in learning over time, metacognition, or social cognition.
Human-relative measurement
The framework anchors comparisons to a demographically representative sample of adults with at least the equivalent of an upper secondary education.
This is a notable shift from benchmark culture that often treats abstract task accuracy as sufficient. Human baselines introduce variance, distributional context, and a more legible way to discuss system capability. If you build products for end users rather than benchmark leaderboards, relative-to-human performance is often the comparison your stakeholders actually care about.
It also sharpens evaluation for systems with memory and adaptation. DeepMind explicitly argues that robust AI should be able to learn and retain new knowledge and skills over time, not only during pretraining or via in-context learning. That connects directly to production work on How to Add Memory to AI Agents and the broader discipline of Context Engineering: The Most Important AI Skill in 2026.
Kaggle competition structure
DeepMind is using Kaggle’s Community Benchmarks product as the implementation layer for the public part of this effort. The competition opened March 17, accepts submissions through April 16, and lists results for June 1.
The prize structure is substantial enough to attract serious participation:
| Competition element | Details |
|---|---|
| Submission window | March 17 to April 16, 2026 |
| Results announcement | June 1, 2026 |
| Total prize pool | $200,000 |
| Track prizes | $10,000 for the top two submissions in each of five tracks |
| Grand prizes | Four awards of $25,000 |
The five target areas are Learning, Metacognition, Attention, Executive functions, and Social cognition. DeepMind identifies these as the biggest current evaluation gaps.
This choice is revealing. Benchmarks for math, code, and static reasoning are already crowded. The missing layer is persistent learning, self-monitoring, task control, and socially situated judgment, exactly the capabilities that start to matter when you move from chatbot demos to long-running agents. If your team works on What Are AI Agents and How Do They Work? or compares orchestration stacks in AI Agent Frameworks Compared: LangChain vs CrewAI vs LlamaIndex, these are the failure modes that conventional evals often miss.
Position in DeepMind’s AGI work
This framework extends DeepMind’s earlier “Levels of AGI” framing with a more operational measurement layer. The practical change is that AGI discussion gets translated into test design, human comparison, and benchmark governance.
There is no model launch here, and no single score for Gemini or any other system. The release is about measurement infrastructure. That matters because evaluation standards shape which capabilities get optimized, reported, and funded.
If you run model evals, the useful move is to stop treating intelligence as one leaderboard column. Build capability profiles, include human baselines where feasible, and add tests for learning, metacognition, and executive control before those gaps become production incidents.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Build Enterprise AI with Mistral Forge on Your Own Data
Learn how Mistral Forge helps enterprises build custom AI models with private data, synthetic data, evals, and flexible deployment.
How to Deploy NVIDIA Dynamo 1.0 for Production AI Inference Across GPU Clusters
Learn how to use NVIDIA Dynamo 1.0 to orchestrate scalable AI inference with KV routing, multimodal support, and Kubernetes scheduling.
How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX
Learn to deploy NVIDIA's Nemotron 3 Nano 4B locally with BF16, FP8, or GGUF on Jetson, RTX, vLLM, TensorRT-LLM, and llama.cpp.
How to Deploy Mistral Small 4 for Multimodal Reasoning and Coding
Learn how to deploy Mistral Small 4 with reasoning controls, multimodal input, and optimized serving on API, Hugging Face, or NVIDIA.
How to Get Started with Open-H, GR00T-H, and Cosmos-H for Healthcare Robotics Research
Learn how to use NVIDIA's new Open-H dataset and GR00T-H and Cosmos-H models to build and evaluate healthcare robotics systems.