Ai Engineering 4 min read

Google DeepMind Unveils AGI Cognitive Evaluation Framework and Launches $200,000 Kaggle Hackathon

Google DeepMind introduced a 10-faculty framework for measuring AGI progress and opened a $200,000 Kaggle evaluation hackathon.

Google DeepMind published a new AGI evaluation framework on March 17 and paired it with a $200,000 Kaggle competition to build missing cognitive benchmarks. The release centers on a new cognitive framework for measuring AGI progress, a capability-focused measurement scheme that matters if you evaluate frontier models, build agents, or rely on benchmark scores to make product decisions.

Cognitive taxonomy

DeepMind organizes AGI-related measurement into 10 cognitive faculties: Perception, Generation, Attention, Learning, Memory, Reasoning, Metacognition, Executive functions, Problem solving, and Social cognition.

The important design choice is scope. This framework measures what a system can do, not the mechanism it uses to do it. For developers, that aligns more closely with deployment reality. If your application depends on planning, memory retention, or social reasoning, the operational question is capability under test conditions, not architectural purity.

The paper goes beyond the high-level labels. Generation includes text, audio, action, and thought generation. Learning includes concept formation, associative learning, reinforcement learning, observational learning, procedural learning, and language learning. This is much closer to a cognitive profile than a leaderboard score.

Evaluation protocol

DeepMind proposes a three-stage protocol:

StagePurpose
Broad held-out cognitive tasksMeasure targeted abilities while reducing contamination
Human baselines on the same tasksCreate a direct comparison set
Mapping against the human distributionShow relative strengths and weaknesses per faculty

Two details matter technically.

First, the framework emphasizes held-out test sets and recommends independent third-party verification. If you work on evaluating AI agents, this reinforces a familiar lesson: benchmark integrity matters more as frontier models absorb public datasets and common eval patterns.

Second, DeepMind does not push a single AGI number. It proposes a cognitive profile relative to human performance distribution. That is a better fit for modern systems, which are already uneven. Models can be strong in coding or reasoning and still weak in learning over time, metacognition, or social cognition.

Human-relative measurement

The framework anchors comparisons to a demographically representative sample of adults with at least the equivalent of an upper secondary education.

This is a notable shift from benchmark culture that often treats abstract task accuracy as sufficient. Human baselines introduce variance, distributional context, and a more legible way to discuss system capability. If you build products for end users rather than benchmark leaderboards, relative-to-human performance is often the comparison your stakeholders actually care about.

It also sharpens evaluation for systems with memory and adaptation. DeepMind explicitly argues that robust AI should be able to learn and retain new knowledge and skills over time, not only during pretraining or via in-context learning. That connects directly to production work on agent memory and the broader discipline of context engineering.

Kaggle competition structure

DeepMind is using Kaggle’s Community Benchmarks product as the implementation layer for the public part of this effort. The competition opened March 17, accepts submissions through April 16, and lists results for June 1.

The prize structure is substantial enough to attract serious participation:

Competition elementDetails
Submission windowMarch 17 to April 16, 2026
Results announcementJune 1, 2026
Total prize pool$200,000
Track prizes$10,000 for the top two submissions in each of five tracks
Grand prizesFour awards of $25,000

The five target areas are Learning, Metacognition, Attention, Executive functions, and Social cognition. DeepMind identifies these as the biggest current evaluation gaps.

This choice is revealing. Benchmarks for math, code, and static reasoning are already crowded. The missing layer is persistent learning, self-monitoring, task control, and socially situated judgment, exactly the capabilities that start to matter when you move from chatbot demos to long-running agents. If your team builds AI agents or compares agent frameworks, these are the failure modes that conventional evals often miss.

Position in DeepMind’s AGI work

This framework extends DeepMind’s earlier “Levels of AGI” framing with a more operational measurement layer. The practical change is that AGI discussion gets translated into test design, human comparison, and benchmark governance.

There is no model launch here, and no single score for Gemini or any other system. The release is about measurement infrastructure. That matters because evaluation standards shape which capabilities get optimized, reported, and funded.

If you run model evals, the useful move is to stop treating intelligence as one leaderboard column. Build capability profiles, include human baselines where feasible, and add tests for learning, metacognition, and executive control before those gaps become production incidents.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading