ARC-AGI: Benchmark for Abstraction & Reasoning in AGI

Updated 10 December 2025

ARC-AGI is a benchmark suite featuring few-shot grid-based tasks designed to test algorithmic abstraction, reasoning, and generalization for AGI.
It challenges models with minimal input–output examples to infer hidden transformations across variable grid sizes and complex pattern manipulations.
The benchmark fosters research on neurosymbolic methods, program synthesis, and hybrid approaches, with performance measured by pixel-perfect accuracy and systematic generalization.

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a benchmark suite of few-shot grid-based transformation tasks designed to evaluate core algorithmic and fluid intelligence in artificial systems. Each ARC-AGI task presents a handful of input–output demonstrations (typically 2–5) of a latent transformation $f$ , after which the system must apply an inferred mapping $f$ to one or more novel test inputs. The benchmark is explicitly constructed to require abstraction, reasoning, and generalization from minimal examples, domains where human intelligence excels but current AI methods remain limited. ARC-AGI, originally introduced by François Chollet in 2019 and later publicly renamed with the “AGI” suffix, aims to measure systematic, developer-aware generalization—making it a central target for research on artificial general intelligence.

1. Problem Definition, Formalism, and Benchmark Structure

Each ARC-AGI task consists of a set of $K$ training examples $(G_i^\mathrm{train}, G_i^\mathrm{train'})_{i=1}^K$ and a test input $G^\mathrm{test}$ , where each grid is a function $G : \{1,\ldots,n\} \times \{1,\ldots,m\} \rightarrow C$ , with $C=\{0,1,\dots,9\}$ denoting the finite palette of integer colors. The goal is to infer a (hidden) mapping $f$ such that, for all demonstrations, $f(G_i^\mathrm{train}) = G_i^\mathrm{train'}$ , and to predict $\hat{y} = f(G^\mathrm{test})$ for the held-out test grid (Min, 2023).

Key challenge dimensions include:

Extreme sample efficiency: only 1–5 demonstrations per task—requiring meta-learning or inductive bias to abstract “rules.”
Combinatorial richness: transformations involving geometric reasoning, color logic, connectivity, symmetry, invariance, pattern continuation, and object-centric manipulations.
Grid variability: non-uniform grid sizes (up to 30×30), flexible color palettes, and arbitrary arrangements of objects and patterns.

Task success is measured by pixel-perfect match—any mismatch is scored as a total failure.

The benchmark has undergone extension and refinement, with the original ARC-AGI-1 comprising 400 public training, 400 evaluation, and additional private test sets; ARC-AGI-2 introduces a rigorously calibrated suite with first-party human baselines and increased cognitive complexity, intended to advance the field beyond brute-force or training-set exploitation (Chollet et al., 17 May 2025).

2. Architectures and Algorithmic Approaches

2.1 Symbolic and Program Synthesis Methods

A dominant line of attack frames ARC-AGI as a program synthesis problem: searching for a program $p$ in a human-crafted or learned DSL such that $p(x_i) = y_i$ for all demonstrations $i$ and applying $p$ to $x_\ast$ in the test set (Bober-Irizar et al., 5 Feb 2024, Alford et al., 2021, Xu et al., 2022, Rocha et al., 10 May 2024, Lei et al., 15 Jan 2024). Several variants exist:

Relational DSLs and Graph Abstractions: Object-centric graph representations, where connected components, attributes, and relations are extracted, and operations are synthesized via a small program space with abstraction and constraint-driven pruning (e.g., ARGA (Xu et al., 2022)).
Inductive Logic Programming (ILP): Direct program induction over compact, interpretable logic clauses targeting object-centric primitives (lines, rectangles, translations); achieves strong compositionality and developer-aware generalization (Rocha et al., 10 May 2024).
Generalized Planning: Task-solving as the synthesis of planning programs in PDDL with pointer-based variable binding and domain-specific abstraction predicates, demonstrating superior performance on object-centric ARC-AGI tasks (Lei et al., 15 Jan 2024).
Neurosymbolic “Wake-Sleep” Synthesis: DreamCoder-style alternation between program enumeration, library compression, and neural recognition-grammar learning; automatically expands primitive sets and supports bootstrapping of higher-level abstractions (Bober-Irizar et al., 5 Feb 2024, Alford et al., 2021).

2.2 Deep Learning and Transformer Approaches

Recent years have seen the emergence of LLM-based and neural methods as competitive approaches:

Transformer-based Grid Modeling: Direct end-to-end training, including pretraining on code/reasoning datasets, followed by test-time adaptation and inference via grid-to-grid sequence modeling (e.g., LongT5 architectures) (Cole et al., 17 Jun 2025).
Test-Time Training and Ensembling: Treating both the neural network (parameters) and the optimizer as adaptive inference components, including Test-Time Fine-Tuning (TTFT, on synthetic puzzles) and ensembling via dihedral augmentation and majority voting (AIRV), yielding state-of-the-art performance under practical compute constraints (Cole et al., 17 Jun 2025).
Product of Experts (PoE) via LLMs: Combining multiple views of a single ARC puzzle via task-specific augmentations (D8 symmetries, color permutations, example orderings), with DFS-based candidate search and log-pooling for solution selection, leading to open-source systems with 71.6% two-guess accuracy on public ARC-AGI (Franzen et al., 8 May 2025).
Program Synthesis with LLMs: Multi-agent systems where LLMs are prompted (or instructed) to synthesize code from multiple abstraction views (grid, object, pixel-centric representations), with iterative feedback and context-grounding (Tan et al., 2023, Min, 2023).
Vision-Language Modularization: Decomposition into vision-centric rule summarization and language-based rule execution, with cross-modal self-correction and verification stages (Zhang et al., 19 Nov 2025).

2.3 Neurosymbolic and Vector Symbolic Algebra Methods

Some approaches seek to integrate fast, heuristic “System 1” processes with slow, symbolic “System 2” reasoning by:

Vector Symbolic Algebras (VSAs): Encoding object properties as composable high-dimensional vectors (e.g., Holographic Reduced Representations with Spatial Semantic Pointers), guiding small neural modules to perform object matching, rule induction, and parameter prediction within a compact, interpretable symbolic program space (Joffe et al., 11 Nov 2025).

2.4 Neural Cellular Automata (NCA) and Developmental Systems

Grid-based self-organizing architectures train local, differentiable update rules per cell, unrolling for a fixed number of steps per ARC input. Variations include:

Vanilla Neural Cellular Automata: Per-grid, per-task NCA trained from scratch, achieving nontrivial solution rates (up to 13.37% exact on feasible tasks) for shape extraction, pattern propagation, and scale-invariant pattern generation (Xu et al., 18 Jun 2025).
Developmental/EngramNCA: Extension with explicit local memory per cell and multihead update modules (GeneCA, GenePropCA), allowing the encoding and propagation of abstract pattern primitives, with solve rates competitive with LLMs at orders of magnitude lower inference cost (Guichard et al., 13 May 2025).

3. Knowledge Augmentation and Human Priors

The difficulty of inducing abstractions from extremely limited data motivates augmenting explicit prior knowledge:

Knowledge Augmentation for Abstract Reasoning (KAAR): Systematic injection of domain ontologies into LLM prompting, staged by levels of object-partitioning, geometric/topological priors, and abstract action schemas, with performance gains of up to ~5% absolute, especially in movement and composition tasks (Lei et al., 23 May 2025).
Prompt Engineering with Human Priors: Encoding common strategies (symmetry, Gestalt, object permanence) as explicit action bullet-lists or in-context cues, with chain-of-thought to guide decomposition and error correction (Min, 2023, Tan et al., 2023).
Visual Question Answering (VQA) Integration: Bridging raw pixel grids to textual object descriptions using specialized modules during multi-agent LLM reasoning (Min, 2023).

4. Empirical Results, Performance, and Limitations

Recent open-source and closed-source ARC-AGI systems report:

Approach	Accuracy (ARC-AGI-1 Public/Eval/Test)	Notes
OpenAI GPT o3 (closed)	82.8% (public)	Proprietary, not reproducible
Product-of-Experts LLM	71.6% (public)	DFS + 16× augmentation, NeMo-8B
Test-Time FT + AIRV (LongT5)	58% (private)	State-of-the-art, no internet
LLM Multi-agent (CoT+VQA)	Up to 50/111 (45%)	On context-eligible tasks
Individual NCA / EngramNCA	10–13% (262 tasks)	Union up to 17% (size-constant)
CompressARC (no pretraining)	20% (public), 34.75% (train)	Pure MDL, 76k param, per-puzzle fit

LLMs perform best when equipped with augmentation, test-time adaptation, and prompt engineering. Symbolic systems remain the strongest on tasks requiring precise geometric manipulation or novel compositions. NCA methods and MDL-driven architectures show surprising generalization on a subset of structured tasks at minimal computational cost (Xu et al., 18 Jun 2025, Guichard et al., 13 May 2025, Liao et al., 5 Dec 2025). No current method matches human-level sample efficiency or generality.

Empirical analyses show that LLMs heavily favor surface-level copy and “union” (matrix-wise) errors, in contrast to humans’ predominantly conceptual failures. Human performance on child-friendly subcorpora (KidsARC) outpaces the best LLMs by a wide margin even at ages 6–8 (Opiełka et al., 13 Mar 2024).

5. Human Reasoning, Datasets, and Alignment

The paper of how humans solve ARC-AGI is advancing via trajectory datasets such as ARCTraj, which logs fine-grained, object-centric human actions across all 400 ARC-AGI-1 tasks (Kim et al., 14 Nov 2025). These traces provide temporally-ordered Markov Decision Process (MDP) representations, enabling the application of behavior cloning, reinforcement learning (PPO), world models (DreamerV3), GFlowNets, diffusion-augmented RL, and Decision Transformers for sequence modeling.

Empirical insights from ARCTraj include:

Spatial selection is highly focused and hierarchical (1×1, 2×2, 3×3) rather than global.
Color usage reflects a reliance on test-input distribution over demonstration-only colors.
Object-level reasoning converges on a small number of strategic “intention clusters.”

These findings position ARCTraj and similar datasets as key for training explainable, aligned models capable of mimicking or extending human-like problem decomposition.

6. Advances, Benchmark Extensions, and Future Directions

ARC-AGI-2 introduces a calibrated suite with greater compositional depth; task difficulty is tuned to human performance, with the top AI models as of 2025 scoring under 5%, and humans at ≈75% average (Chollet et al., 17 May 2025). Machine–human performance gaps remain vast, particularly in multi-step and compositional reasoning.

Methodological directions highlighted for future progress:

Retrieval-augmented reasoning and memory utilization (for continual learning and faster adaptation) (Min, 2023, Zhang et al., 19 Nov 2025).
Unified vision-language architectures, with strategic modality routing (e.g., Vision-Language Synergy Reasoning and Modality-Switch Self-Correction), shown to yield incremental but systematic performance improvements (Zhang et al., 19 Nov 2025).
Learning structure via Unsupervised MDL or information-theoretic approaches rather than heavy pretraining (Liao et al., 5 Dec 2025).
Integration of fast, local System 1–style heuristics with slow, compositional System 2 program synthesis in hybrid neurosymbolic architectures (Joffe et al., 11 Nov 2025).
Expanding DSLs and planning models to support unbounded looping, recursion, higher-order primitive induction, and context-sensitive rule extraction (Rocha et al., 10 May 2024, Lei et al., 15 Jan 2024).

Significant open challenges persist: robust abstraction discovery, representation selection, scalable test-time adaptation, efficient exploration of DSLs, and closing the systematic generalization gap to humans. ARC-AGI and its successors provide a stringent, continually relevant benchmark for evaluating progress toward genuine artificial general intelligence.