Google DeepMind maps a new way to score AI systems on the road to AGI

Google DeepMind is trying to answer one of AI’s most uncomfortable questions: how do you actually tell if you’re getting closer to “real” general intelligence, and not just building a better autocomplete? Instead of chasing a single magic number or leaderboard, the team is now proposing a cognitive-style blueprint for measuring progress toward AGI – and they’re turning it into a public Kaggle challenge with serious prize money behind it.

At the heart of this move is a simple but quietly radical shift: treat AI systems less like black boxes that happen to get high scores on benchmarks, and more like students sitting for a very broad cognitive exam. DeepMind’s new paper, “Measuring Progress Toward AGI: A Cognitive Taxonomy,” leans heavily on decades of psychology and neuroscience research to carve human intelligence into 10 core abilities — things like perception, learning, memory, reasoning, metacognition, and social cognition. The claim is not that AGI is “10 checkboxes away,” but that if you want an honest read on how general an AI system really is, you need to see how it behaves across this whole landscape, not just in a couple of cherry‑picked tasks.

Some of the abilities in the framework are pretty familiar from today’s models. Perception is basically how well a system can take in information from the world – text, audio, images, video – and parse it into something usable. Generation is what LLMs are famous for: producing coherent text, code, speech, or actions in an environment. Attention, in this context, is less about transformer architectures and more about whether a system can selectively focus on what matters in a task instead of being distracted by noise.

The more interesting – and frankly more uncomfortable – parts of the taxonomy live in the higher‑order stuff. Learning is framed as the ability to pick up new skills or concepts from experience or instruction, not just regurgitate what was baked in during pretraining. Memory is about storing and retrieving information over time, especially long‑term, which recent work suggests is still a glaring weakness in many large models despite their apparent knowledge. Reasoning is explicitly defined as drawing valid conclusions via logic, including deductive, inductive, analogical and mathematical reasoning – and the paper makes a point that pattern‑matching shortcuts don’t count.

Then there’s metacognition and executive function – essentially “thinking about thinking” and the ability to plan, inhibit bad impulses, and switch strategies when needed. DeepMind’s taxonomy treats these as separate from raw problem solving, which is described as a composite ability that pulls together perception, learning, reasoning, and planning to actually crack domain‑specific challenges. Finally, social cognition covers everything from reading other agents’ intentions to cooperation, negotiation, and even persuasion or deception – which the authors explicitly flag as double‑edged in terms of safety risk.

A framework is only as good as the tests built on top of it, so DeepMind is pairing the theory with a concrete evaluation pipeline. The protocol they describe has three stages: first, build a broad suite of cognitive tasks for each of the 10 abilities; second, collect human baselines on those tasks from a demographically representative sample of adults; and third, map each AI system’s performance to the corresponding human distribution. In other words, the goal is not “GPT-X got 82% on benchmark Y,” but “this model is roughly median‑human on attention, above average on knowledge‑heavy reasoning, and far below human on metacognition and long‑term memory.”

If this sounds a bit like psychometrics for machines, that’s intentional. The framework borrows from long‑standing human intelligence models, such as the Cattell-Horn-Carroll theory that splits cognition into broad factors and narrower skills. Other AGI-evaluation efforts are already heading in this direction as well; for example, recent “AGI scoring” proposals talk about decomposing general intelligence into multiple cognitive axes and then aggregating them, sometimes even assigning notional “AGI percentages” to today’s frontier systems, while emphasizing how jagged and uneven those cognitive profiles still are. Across surveys and safety reports, there is a growing consensus that classic narrow benchmarks miss huge parts of the picture and that cognitively grounded batteries are a more honest way to track where these systems are actually strong or brittle.

What makes DeepMind’s announcement more than just another paper is the decision to crowdsource a big chunk of the actual tests via Kaggle. The “Measuring progress toward AGI: Cognitive abilities” competition asks participants to design benchmarks that isolate five of the hardest‑to‑measure abilities: learning, metacognition, attention, executive functions and social cognition. Kaggle’s new Community Benchmarks platform will then host these evaluations, run them against a line‑up of leading models, and keep them alive as reusable public tests rather than one‑off leaderboard stunts.

There is real money on the table. DeepMind and Kaggle are putting up a $200,000 prize pool, with $10,000 for the top two submissions in each of the five tracks and four grand prizes of $25,000 for the strongest overall entries. Submissions run from March 17 through April 16, with results expected June 1, giving researchers and hackers just under a month to turn cognitive theory into concrete, model‑breaking tasks. Kaggle is positioning this as a way to move beyond “does the model remember this fact?” toward questions like “can it actually reason, adapt, and self‑monitor under pressure?”

Zooming out, this push slots into a larger trend: the slow pivot from pure benchmark‑chasing to more robust, cognitively informed evaluations of AI systems. Benchmarks like ARC-AGI, for instance, already try to probe fluid intelligence – the ability to solve completely novel puzzles from a handful of examples – and have become de facto progress meters for abstract reasoning because they resist brute‑force memorization and simple scaling tricks. The emerging picture from those efforts is that modern models can score impressively on many static tests while still faltering badly when pushed into tasks that demand on‑the‑fly rule discovery, long‑horizon planning, or human‑like common sense.

DeepMind’s cognitive framework does not magically settle the AGI definition debate, and the authors are fairly explicit that general intelligence is continuous and multidimensional rather than a single on/off threshold. But it does give labs, regulators, and the broader research community a shared vocabulary: instead of arguing in the abstract about whether “AGI is near,” they can start talking about concrete trajectories – how quickly are systems closing the gap on metacognition, how far below human they remain on social cognition, and which abilities plateau even as models scale.

For everyday users and policymakers, the implications are straightforward but important. If general‑purpose AI continues to drive real‑world decisions in science, healthcare, finance, and public services, then knowing what kind of intelligence these systems actually have – and what kinds they clearly do not – becomes a safety and governance requirement, not an academic curiosity. DeepMind’s move essentially says: if the industry is going to talk seriously about AGI, it needs equally serious, cognitively grounded ways to measure progress and expose blind spots, and it is willing to open that measurement problem up to the wider community rather than solving it behind closed doors.