ARC-AGI: Benchmarking Abstraction in AGI

Updated 18 November 2025

ARC-AGI is a comprehensive benchmark suite designed to evaluate human-level abstraction and compositional reasoning through unique few-shot grid transformation tasks.
It integrates diverse methodologies including symbolic search, neuro-symbolic synthesis, and vision-to-language paradigms to diagnose abstract, in-context rule inference.
Empirical findings reveal a stark performance gap—humans achieve 60–75% accuracy while top AI systems score below 30%—underscoring significant challenges in AGI.

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a canonical benchmark suite designed to probe broad, fluid intelligence in artificial systems. Originating from François Chollet's ARC corpus, ARC-AGI tasks present few-shot input–output grid transformations that demand the inference of novel, symbolic rules from 2–5 demonstrations and subsequent precise generalization to unseen test cases. Since its introduction in 2019, ARC-AGI has become the de facto standard for evaluating human-level abstraction, compositional generalization, and non-trivial reasoning in both neural and symbolic machine architectures. Despite significant advances in deep learning and LLMs, ARC-AGI remains largely unsolved—humans routinely exceed 60–75% accuracy, while even high-compute AI systems achieve under 30% on the original corpus and below 5% on the more difficult ARC-AGI-2 extension. This article surveys foundational principles, solution paradigms, key empirical findings, and major methodological innovations tied to ARC-AGI, with emphasis on recent progress and outstanding challenges.

1. Principles, Structure, and Benchmarks of ARC-AGI

ARC-AGI is a generative, developer-aware few-shot reasoning suite comprising hundreds (ARC-AGI-1: 900; ARC-AGI-2: 240–1000) grid transformation tasks. Each task $p$ is defined by

$p = \left(\left\{(x_i, y_i)\right\}_{i=1}^k, \hat{x}\right),$

where $(x_i, y_i)$ are $k$ demonstration pairs and $\hat{x}$ is the test input. Grid sizes are $\leq 30 \times 30$ using a palette of ten discrete colors. Task rules span an unbounded set of transformations: morphologies (object detection, shape manipulation), coloring, spatial reasoning, compositional and contextual rule application, and in-context semantic symbol definition. ARC-AGI tasks are carefully curated so that each is unique; they resist training-set memorization and brute-force enumeration. The AGI benchmark evaluates exact match accuracy—a task is solved only if every cell of the generated test output matches the ground-truth, with up to 2–10 output guesses permitted by some protocols.

ARC-AGI-2 introduces greater complexity with multi-rule, multi-step, and context-conditioned tasks. Human testing shows median accuracy $> 62\%$ , with no strong demographic predictors. AI system performance on ARC-AGI-2 is starkly lower: top open-source models (e.g., o3-mini, ARChitects, GPT-4-like systems) consistently score $< 5\%$ (Chollet et al., 17 May 2025).

2. Symbolic and Neuro-symbolic Program Synthesis Approaches

Early state-of-the-art solvers have relied on developer-defined Domain Specific Languages (DSLs) and various forms of symbolic search, from brute-force enumeration to constraint-aware program synthesis. Notable innovations include:

Inductive Logic Programming (ILP): ARC grids are parsed into objects (points, lines, rectangles), with primitive relations (copy, translate, point_straight_path_to). Given few demonstrations, ILP (using FOIL-style information gain) induces Horn clause logic programs that generalize across input–output pairs. Human-level abstraction is achieved when learned clauses rely only on object-centric properties rather than pixel coordinates. Empirically, well-engineered ILP systems solve nontrivial tasks in seconds, with full interpretability (Rocha et al., 2024).
Graph-based Relational DSLs (ARGA): Images are converted to graphs (nodes=object abstractions; edges=spatial or color relations). Relational filters, parametric bindings, and transformations are synthesized via search. Constraint acquisition, hashing, Tabu abstraction switching, and best-first heuristics prune the exponential search space. On 160 ARC tasks, ARGA achieves $36\%$ test accuracy with three orders of magnitude fewer search nodes than top brute-force solvers (Xu et al., 2022).
Generalized Planning (GPAR, PDDL-based): ARC problems are lifted to Planning Domain Definition Language, with problem variables and external object-centric functions. Program synthesis proceeds using pointer-based looping, stringent domain knowledge pruning, and novelty constraints. GPAR demonstrates improved train–test generalization (e.g., $50.63\%$ test accuracy on 160 object-centric tasks) over previous symbolic frameworks (Lei et al., 2024).

3. Visuo-linguistic and LLM Paradigms

Recent studies have shifted from pure program enumeration to exploiting large scale neural priors via vision-to-language (V2L), neuro-symbolic, and LLM-centric pipelines:

Vision-to-Language (V2L) Reasoning: ARC grids are encoded into natural language descriptions $T$ via a deterministic function combining CNN-based background color detection, connected component analysis, shape-type classification, and object attribute extraction. Pre-trained LLMs ("zero-shot," e.g., GPT-3, Bloom) receive $T$ as prompt, generating output grid descriptions $y$ which are parsed by deterministic language-to-vision decoders $f_{\text{dec}}(y)$ . Accuracies grow monotonically with model size, reaching $15–16\%$ for large GPT-3, and notably solving certain ARC tasks unsolved by top program-search solvers. This paradigm demonstrates that learned linguistic priors can drive non-trivial visual reasoning and opens the possibility for unified visuo-linguistic AGI systems (Camposampiero et al., 2023).
Product-of-Experts with LLMs: Here, each "perspective" (data augmentation—rotation, reflection, color permutation, input ordering) forms an LLM scoring expert. Candidate solutions are generated via probabilistic depth-first search (DFS) and scored using the geometric mean of token probabilities across augmentations:

$P_{\text{PoE}}(s|p) = \frac{1}{Z(p)} \prod_{j=1}^m E_j(s|p)^{w_j},$

yielding ensemble-based selection that penalizes candidates that any perspective finds implausible. Task-aware augmentations, combined with LLM both as generator and scorer, push accuracy to $71.6\%$ (286.5/400 tasks), establishing a new state-of-the-art at extremely low computational cost (Franzen et al., 8 May 2025).

Neuro-symbolic Solvers: Transformer-based proposal generators focus and prune DSL search directions, dramatically shrinking combinatorial complexity. Test-time adaptation with generated synthetic tasks further boosts the probability of proposal hits. These models demonstrate 27% gains over prior state-of-the-art LLM-DSL hybrids (Batorski et al., 8 Jan 2025).

4. Human-centered and Imitation Learning Perspectives

Recent empirical work has focused on understanding and replicating human reasoning in ARC-AGI:

ARCTraj Dataset: Over 10,000 human reasoning trajectories are recorded and abstracted into symbolic action schemas (Selection, Coloring, Object-Oriented manipulations) and Markov Decision Process (MDP) states. Downstream integration with behavioral cloning, PPO, world models, GFlowNets, and Decision Transformers enables explainable, aligned, and generalizable policy learning, achieving 38–100% task success depending on method (Kim et al., 14 Nov 2025).
Object-Centric Decision Transformers: Imitation learning on expert traces, coupled with object-centric clustering (Push-and-Pull algorithm), allows Transformer models to sequence actions in a human-like fashion for composite ARC tasks. Object-centric inputs significantly boost performance (e.g., $+13–16\%$ accuracy) and transfer across unseen instance grids (Park et al., 2023).
Language-Complete ARC (LARC): Human participants communicate ARC solutions through natural language "programs," revealing meta-communicative scaffolding (framing, validation, analogy) absent from classical DSLs. These natural programs include domain-specific primitives and non-executable instructions, presenting challenges for current synthesis engines and motivating future neuro-symbolic systems with adaptive DSL induction and constraint-based multi-channel modeling (Acquaviva et al., 2021).

5. Neural, Developmental, and VSA-based Models

Emerging paradigms exploit the self-organizing capacity of neural and neurosymbolic models:

Neural Cellular Automata (NCA, ARC-NCA): Differentiable cellular automata, with local convolutional update rules and memory augmentations (EngramNCA), are trained per-task to grow grid transformations. With local (50 channel) and hidden (memory) states, these models achieve up to $15.3\%$ union solve rates at several orders of magnitude lower cost than LLMs, but generalization is fragile and limited to tasks amenable to local interactions. Developmental principles—iterative refinement, memory separation, pattern emergence—underpin new forms of reasoning (Guichard et al., 13 May 2025, Xu et al., 18 Jun 2025).
Vector Symbolic Algebra (VSA): VSAs, relying on holographic reduced representations and slot-filler binding, encode abstract objects and guide object-centric program synthesis. By integrating fast, associative heuristics ("System 1") with deliberate rule induction ("System 2"), sample-efficient and interpretable solutions are achieved (e.g., $12.7\%$ query accuracy on ARC-AGI-1-Train, $94.5\%$ on simplified Sort-of-ARC), with cognitive plausibility and extensibility to broader cognitive tasks (Joffe et al., 11 Nov 2025).

6. Scaling, Evaluation, and Future Directions

ARC-AGI-2: Modern extension introduces difficulty-calibrated, compositional, multi-context, and symbol-definition tasks, verified by large-scale human trials (>$400$ participants, $\sim75\%$ average task accuracy). No current AI system solves more than $5\%$ of ARC-AGI-2. Aggregate human and AI test results demonstrate profound gaps in abstraction, contextualization, dynamic chaining, and non-literal symbol inference (Chollet et al., 17 May 2025).
Scaling Deep Learning Methods: Fully neural models, especially with aggressive test-time fine-tuning and voting over spatial augmentations (AIRV, TTFT), set new records (up to $58\%$ accuracy), provided both the optimizer and network are allowed to adapt dynamically during inference. Pre-training on ARC-like data and dynamic support-set construction are central for robust generalization (Cole et al., 17 Jun 2025).
Practical ensemble strategies: LLMs, DreamCoder, and hand-coded DSLs solve disjoint subsets of tasks; simple voting ensembles consistently outperform any single method (e.g., $57\%$ accuracy on ARC-Easy by combining DreamCoder, GPT-4, and Icecuber) (Bober-Irizar et al., 2024).
Ontological Knowledge Augmentation: Stage-wise expansion of core knowledge priors—objectness, geometry/topology, counting, and goal-directedness—substantially improves LLM generalization, driving up to $64.5\%$ relative gains with repeated-sampling planning-aided code generation. Dynamic selection and integration of priors, and deeper alignment between LLM reasoning steps and ontological stages, remain open challenges (Lei et al., 23 May 2025).

7. Interpretability, Limitations, and AGI Implications

ARC-AGI exposes distinct strengths and trade-offs among solution paradigms:

Interpretability: Symbolic and graph-centric methods naturally express solution logic and enable stepwise explanations, matching human expectations for verifiable mid-process reasoning (Ferré, 2021, Lim et al., 2024).
Data Efficiency and Generalization: Methods that leverage explicit object-centric backgrounds, symbolic priors, and few-shot reasoning demonstrate scaling advantages over end-to-end neural approaches, but are challenged by task heterogeneity and non-local dependencies.
Modularity and Future Integration: Modular architectures combining vision, language, reasoning, and planning modules, with plug-and-play improvements in encoding, augmentation, and neuro-symbolic proposal generation, suggest routes toward scalable, unified AGI (Camposampiero et al., 2023, Batorski et al., 8 Jan 2025).
Limits and Prospects: Brute-force search and extensive hand-coded priors are bottlenecked by combinatorial explosion and limited developer scalability. Deep learning methods are susceptible to overfitting and inference failures on novel compositions unless dynamically optimized. Developmental and VSA models highlight the promise and sample efficiency of neurosymbolic cognition, but require improved generalization to carry forward into broad AGI reasoning.

ARC-AGI and its modern extension (ARC-AGI-2) remain uniquely valuable for benchmarking abstract visual reasoning, compositional cognitive architecture design, and measurement of progress towards general, developer-aware machine intelligence. Continued research will require synergistic integration of symbolic, neural, and human-inspired imitation learning approaches, underpinned by large-scale public datasets and transparent evaluation.