ARC-AGI-1 Benchmark for AGI Evaluation

Updated 21 November 2025

ARC-AGI-1 Benchmark is a testbed for AGI systems focusing on few-shot rule induction and abstract reasoning using hand-crafted grid tasks.
It employs both textual and visual encodings with metrics like Pass@1 to assess performance across 1,000 diverse tasks.
The benchmark drives advances in program synthesis, vision-language synergy, and cognitive modeling by challenging out-of-distribution generalization.

The ARC-AGI-1 Benchmark is a rigorous testbed for evaluating artificial general intelligence (AGI) systems’ ability to perform abstract reasoning and rule induction from minimal examples. It comprises diverse, hand-crafted grid-based tasks designed to mirror “learning how to learn” in humans, specifically focusing on few-shot conceptual rule induction and transfer to novel problems. ARC-AGI-1 tasks have become central to measuring out-of-distribution generalization, compositional reasoning, and cognitive plausibility for frontier foundation models (Zhang et al., 19 Nov 2025).

1. Formal Definition and Objectives

ARC-AGI-1 presents each solver with a set of tasks $\{\tau_j\}$ , each defined by $K$ input–output training pairs and a hidden test input:

Each grid $m \in \mathbb{Z}^{H \times W}$ for $H,W \leq 30$ , with cell values $0 \ldots 9$ .
Task $\tau$ consists of $\{(m_i^{in}, m_i^{out}) | i = 1 \ldots K\}$ and one $m_\text{test}^{in}$ .
The underlying transformation is always a deterministic rule $r: \mathbb{Z}^{H \times W} \to \mathbb{Z}^{H \times W}$ , which must be inferred from a handful of examples.
Modalities:
- Textual encoding: $t = \mathcal{T}(m)$
- Visual encoding: $i = \mathcal{V}(m)$
- Both $\mathcal{T}$ and $\mathcal{V}$ are invertible.

The principal evaluation metric is exact match (Pass@1) accuracy over a holdout set:

$\mathrm{Acc} = \frac{1}{N} \sum_{j=1}^N \mathbf{1}\left[m_\text{pred}^{(j)} = m_\text{gt}^{(j)}\right]$

This assesses whether the model's predicted output grid $m_\text{pred}$ matches the ground truth $m_\text{gt}$ for each test input, over $N$ tasks.

2. Benchmark Structure, Modalities, and Metrics

ARC-AGI-1 contains 1,000 total tasks with 400 public “train,” 400 public “eval,” and 200 private tasks, each requiring solvers to generalize across transformations such as symmetries, counting, object manipulation, and spatial relations (Chollet et al., 2024).

Each task only shares the high-level structure—no two have the same transformation logic.
Input/output grids are either serialized as text sequences or rendered as color-coded images.
Base metrics:
- Pass@1 (single guess per test)
- Two-guess accuracy (for competitions): proportion of tasks solved in up to two guesses.

Table: Task Split and Metrics

Split	# Tasks	Main Metric
Train	400	Pass@1
Public Eval	400	Pass@1, 2-guess
Private Eval	200	Pass@1

Significance: ARC-AGI-1 is specifically designed to require generalization on previously unseen rules, precluding brute-force memorization and stressing abstract reasoning (Pfister et al., 13 Jan 2025).

3. Algorithmic Paradigms and Synergistic Strategies

ARC-AGI-1 has catalyzed the development of several advanced solver paradigms.

Program Synthesis and Evolutionary Search

Deep learning-guided program synthesis utilizes code LLMs to generate candidate programs in small DSLs, executed and filtered against training examples (Chollet et al., 2024, Pourcel et al., 10 Jul 2025).
SOAR (Self-Improving Operators for Automated program Refinements) iteratively fine-tunes LLMs via hindsight learning from search traces, enabling dramatic gains in solve rates (up to 52% with ensembles) (Pourcel et al., 10 Jul 2025).

Vision-Language Synergy and Self-Correction

Vision-Language Synergy Reasoning (VLSR) and Modality-Switch Self-Correction (MSSC) are two synergistic inference strategies (Zhang et al., 19 Nov 2025):

VLSR: Decomposes reasoning into Phase 1 (visual rule summarization) and Phase 2 (textual rule application), using vision for holistic pattern abstraction and language for element-wise rule execution.
MSSC: Cross-modal error correction loop, critiquing text outputs via vision to overcome confirmation bias, yielding monotonic accuracy improvements.

Pipeline pseudocode:

r_pred = LVLM_vision("Summarize rule", images_in, images_out)

t_pred = LVLM_text("Apply rule", r_pred, text_grids, test_input_text)

for r in range(N_max):
    is_valid = LVLM_vision("Judge output", test_input_image, predicted_image)
    if is_valid: break
    feedback = "Your output breaks [spatial pattern X]. Revise."
    t_pred = LVLM_text("Apply rule with feedback", feedback, ...)

VLSR + MSSC: Average gain of 4.33% over text-only baselines across major models (Zhang et al., 19 Nov 2025).

4. Empirical Results and Comparative Performance

Extensive benchmarking has established state-of-the-art and baseline scores:

Frontier LLMs (GPT-4o, Gemini, Qwen3-VL, o4-mini): Text-only accuracy ranges from 8.25% to 42.25% (Zhang et al., 19 Nov 2025).
VLSR: +3.02% improvement on average.
MSSC: +1.82% improvement.
Combined VLSR+MSSC: Up to 4.33% over baseline.
SOAR ensemble: 52% tasks solved (Pourcel et al., 10 Jul 2025).
Product-of-Experts with LLMs and systematic data augmentations reaches 71.6% at $0.02 per task—surpassing average human baseline (Franzen et al., 8 May 2025).
Best private eval scores (ARC Prize 2024): ~55.5% (Chollet et al., 2024).

Table: Pass@1 Accuracy for Main Approaches

Model	Text-Only	VLSR+MSSC	SOAR Ensemble	PoE+DFS
GPT-4o	8.25%	14.50%	—	—
Gemini-2.5-Pro	35.00%	42.25%	—	—
o4-mini	42.25%	46.75%	—	—
Qwen3-VL	20.25%	22.25%	—	—
SOAR Ensemble	—	—	52.00%	—
PoE+DFS	—	—	—	71.6%

5. Cognitive Modelling and Neurosymbolic Methods

Solutions informed by human cognition and symbolic reasoning have been explored:

Vector Symbolic Algebra-based systems combine System 1 perceptual heuristics (object segmentation, fast similarity) and System 2 symbolic program inference (minimum-hitting set search, parameter induction) (Joffe et al., 11 Nov 2025).
Task composition leverages symbolic operations (e.g., Extract, Recolour, Grow), employing HRRs and SSPs for object and position representations.
Sample efficiency and interpretability are high, but benchmark-level generalization remains limited (e.g., 10.8% on ARC-AGI-1 Train, 3.0% on Eval).

Neural Cellular Automata (NCA) variants (including EngramNCA) constitute developmental, self-organizing meta-models. These iterate local update rules across grids, showing promise in generalizing to novel grid sizes and patterns at low compute cost (Guichard et al., 13 May 2025, Xu et al., 18 Jun 2025).

6. Human Trajectories and Alignment

ARCTraj augments ARC-AGI-1 by recording and analyzing human object-level action trajectories on 400 training tasks (Kim et al., 14 Nov 2025):

Formalizes ARC as a finite-horizon Markov Decision Process with states as grid-object compositions and actions as symbolic transformation triplets.
Provides a human baseline for explainability and aligns reinforcement learning, sequence modeling, and generative methods to real human reasoning.
Shows that common selection biases (compact regions), color-attribution logic, and strategy grammars provide rich inductive signals for AGI model alignment.

7. Limitations, Open Questions, and Future Directions

Several limitations persist:

Modal switching overhead for vision–language pipelines; visualization resolution bottlenecks for large grids.
ARC-AGI-1’s tractability for brute-force program search with fixed DSL primitives diminishes its capacity to measure true abstraction (skills vs. intelligence) (Pfister et al., 13 Jan 2025).
Dataset overfitting risk due to limited private evaluation set size; index 49% of tasks are brute-force solvable (Chollet et al., 2024).
ARC-AGI-2 proposals: expand pools, rebalance difficulty, introduce procedural novelty metrics, track concept-specific progress.

Open future directions include:

Joint multimodal training for vision–LLMs beyond zeroshot inference (Zhang et al., 19 Nov 2025).
LLM–NCA hybrid pipelines, adaptive search via neural priors, enhanced representation learning for symbolic compositionality (Guichard et al., 13 May 2025).
Universal benchmarks based on diversity-weighted, skill-generation metrics rather than fixed puzzle sets (Pfister et al., 13 Jan 2025).
Human-aligned learning strategies and intention prediction modules derived from ARCTraj (Kim et al., 14 Nov 2025).

In sum, the ARC-AGI-1 Benchmark remains the canonical AGI testbed for few-shot abstraction, requiring systems that synergize vision and language, symbol and neural computation, and human-like reasoning trajectories. Recent advances in evolutionary program synthesis, vision–language decomposition, and cognitive modelling have significantly shifted the frontier, though full AGI-level abstraction and transfer remain unsolved (Zhang et al., 19 Nov 2025, Franzen et al., 8 May 2025, Joffe et al., 11 Nov 2025, Guichard et al., 13 May 2025, Pourcel et al., 10 Jul 2025, Chollet et al., 2024, Pfister et al., 13 Jan 2025).