Google DeepMind Unveils AGI Cognitive Evaluation Framework and Launches $200,000 Kaggle Hackathon

Google DeepMind published a new AGI evaluation framework on March 17 and paired it with a $200,000 Kaggle competition to build missing cognitive benchmarks. The release centers on Measuring progress toward AGI: A cognitive framework, a capability-focused measurement scheme that matters if you evaluate frontier models, build agents, or rely on benchmark scores to make product decisions.

Cognitive taxonomy

DeepMind organizes AGI-related measurement into 10 cognitive faculties: Perception, Generation, Attention, Learning, Memory, Reasoning, Metacognition, Executive functions, Problem solving, and Social cognition.

The important design choice is scope. This framework measures what a system can do, not the mechanism it uses to do it. For developers, that aligns more closely with deployment reality. If your application depends on planning, memory retention, or social reasoning, the operational question is capability under test conditions, not architectural purity.

The paper goes beyond the high-level labels. Generation includes text, audio, action, and thought generation. Learning includes concept formation, associative learning, reinforcement learning, observational learning, procedural learning, and language learning. This is much closer to a cognitive profile than a leaderboard score.

Evaluation protocol

DeepMind proposes a three-stage protocol:

Stage	Purpose
Broad held-out cognitive tasks	Measure targeted abilities while reducing contamination
Human baselines on the same tasks	Create a direct comparison set
Mapping against the human distribution	Show relative strengths and weaknesses per faculty

Two details matter technically.

First, the framework emphasizes held-out test sets and recommends independent third-party verification. If you work on How to Evaluate and Test AI Agents, this reinforces a familiar lesson: benchmark integrity matters more as frontier models absorb public datasets and common eval patterns.

Second, DeepMind does not push a single AGI number. It proposes a cognitive profile relative to human performance distribution. That is a better fit for modern systems, which are already uneven. Models can be strong in coding or reasoning and still weak in learning over time, metacognition, or social cognition.

Human-relative measurement

The framework anchors comparisons to a demographically representative sample of adults with at least the equivalent of an upper secondary education.

This is a notable shift from benchmark culture that often treats abstract task accuracy as sufficient. Human baselines introduce variance, distributional context, and a more legible way to discuss system capability. If you build products for end users rather than benchmark leaderboards, relative-to-human performance is often the comparison your stakeholders actually care about.

It also sharpens evaluation for systems with memory and adaptation. DeepMind explicitly argues that robust AI should be able to learn and retain new knowledge and skills over time, not only during pretraining or via in-context learning. That connects directly to production work on How to Add Memory to AI Agents and the broader discipline of Context Engineering: The Most Important AI Skill in 2026.

Kaggle competition structure

DeepMind is using Kaggle’s Community Benchmarks product as the implementation layer for the public part of this effort. The competition opened March 17, accepts submissions through April 16, and lists results for June 1.

The prize structure is substantial enough to attract serious participation:

Competition element	Details
Submission window	March 17 to April 16, 2026
Results announcement	June 1, 2026
Total prize pool	$200,000
Track prizes	$10,000 for the top two submissions in each of five tracks
Grand prizes	Four awards of $25,000

The five target areas are Learning, Metacognition, Attention, Executive functions, and Social cognition. DeepMind identifies these as the biggest current evaluation gaps.

This choice is revealing. Benchmarks for math, code, and static reasoning are already crowded. The missing layer is persistent learning, self-monitoring, task control, and socially situated judgment, exactly the capabilities that start to matter when you move from chatbot demos to long-running agents. If your team works on What Are AI Agents and How Do They Work? or compares orchestration stacks in AI Agent Frameworks Compared: LangChain vs CrewAI vs LlamaIndex, these are the failure modes that conventional evals often miss.

Position in DeepMind’s AGI work

This framework extends DeepMind’s earlier “Levels of AGI” framing with a more operational measurement layer. The practical change is that AGI discussion gets translated into test design, human comparison, and benchmark governance.

There is no model launch here, and no single score for Gemini or any other system. The release is about measurement infrastructure. That matters because evaluation standards shape which capabilities get optimized, reported, and funded.

If you run model evals, the useful move is to stop treating intelligence as one leaderboard column. Build capability profiles, include human baselines where feasible, and add tests for learning, metacognition, and executive control before those gaps become production incidents.

Google DeepMind Unveils AGI Cognitive Evaluation Framework and Launches $200,000 Kaggle Hackathon

Cognitive taxonomy

Evaluation protocol

Human-relative measurement

Kaggle competition structure

Position in DeepMind’s AGI work

Keep Reading

How to Build Enterprise AI with Mistral Forge on Your Own Data

How to Deploy NVIDIA Dynamo 1.0 for Production AI Inference Across GPU Clusters

How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX

How to Deploy Mistral Small 4 for Multimodal Reasoning and Coding

How to Get Started with Open-H, GR00T-H, and Cosmos-H for Healthcare Robotics Research