Papers
Topics
Authors
Recent
2000 character limit reached

ARC-AGI-1 Benchmark for AGI Evaluation

Updated 21 November 2025
  • ARC-AGI-1 Benchmark is a testbed for AGI systems focusing on few-shot rule induction and abstract reasoning using hand-crafted grid tasks.
  • It employs both textual and visual encodings with metrics like Pass@1 to assess performance across 1,000 diverse tasks.
  • The benchmark drives advances in program synthesis, vision-language synergy, and cognitive modeling by challenging out-of-distribution generalization.

The ARC-AGI-1 Benchmark is a rigorous testbed for evaluating artificial general intelligence (AGI) systems’ ability to perform abstract reasoning and rule induction from minimal examples. It comprises diverse, hand-crafted grid-based tasks designed to mirror “learning how to learn” in humans, specifically focusing on few-shot conceptual rule induction and transfer to novel problems. ARC-AGI-1 tasks have become central to measuring out-of-distribution generalization, compositional reasoning, and cognitive plausibility for frontier foundation models (Zhang et al., 19 Nov 2025).

1. Formal Definition and Objectives

ARC-AGI-1 presents each solver with a set of tasks {τj}\{\tau_j\}, each defined by KK input–output training pairs and a hidden test input:

  • Each grid mZH×Wm \in \mathbb{Z}^{H \times W} for H,W30H,W \leq 30, with cell values 090 \ldots 9.
  • Task τ\tau consists of {(miin,miout)i=1K}\{(m_i^{in}, m_i^{out}) | i = 1 \ldots K\} and one mtestinm_\text{test}^{in}.
  • The underlying transformation is always a deterministic rule r:ZH×WZH×Wr: \mathbb{Z}^{H \times W} \to \mathbb{Z}^{H \times W}, which must be inferred from a handful of examples.
  • Modalities:
    • Textual encoding: t=T(m)t = \mathcal{T}(m)
    • Visual encoding: i=V(m)i = \mathcal{V}(m)
    • Both T\mathcal{T} and V\mathcal{V} are invertible.

The principal evaluation metric is exact match (Pass@1) accuracy over a holdout set:

Acc=1Nj=1N1[mpred(j)=mgt(j)]\mathrm{Acc} = \frac{1}{N} \sum_{j=1}^N \mathbf{1}\left[m_\text{pred}^{(j)} = m_\text{gt}^{(j)}\right]

This assesses whether the model's predicted output grid mpredm_\text{pred} matches the ground truth mgtm_\text{gt} for each test input, over NN tasks.

2. Benchmark Structure, Modalities, and Metrics

ARC-AGI-1 contains 1,000 total tasks with 400 public “train,” 400 public “eval,” and 200 private tasks, each requiring solvers to generalize across transformations such as symmetries, counting, object manipulation, and spatial relations (Chollet et al., 5 Dec 2024).

  • Each task only shares the high-level structure—no two have the same transformation logic.
  • Input/output grids are either serialized as text sequences or rendered as color-coded images.
  • Base metrics:
    • Pass@1 (single guess per test)
    • Two-guess accuracy (for competitions): proportion of tasks solved in up to two guesses.

Table: Task Split and Metrics

Split # Tasks Main Metric
Train 400 Pass@1
Public Eval 400 Pass@1, 2-guess
Private Eval 200 Pass@1

Significance: ARC-AGI-1 is specifically designed to require generalization on previously unseen rules, precluding brute-force memorization and stressing abstract reasoning (Pfister et al., 13 Jan 2025).

3. Algorithmic Paradigms and Synergistic Strategies

ARC-AGI-1 has catalyzed the development of several advanced solver paradigms.

  • Deep learning-guided program synthesis utilizes code LLMs to generate candidate programs in small DSLs, executed and filtered against training examples (Chollet et al., 5 Dec 2024, Pourcel et al., 10 Jul 2025).
  • SOAR (Self-Improving Operators for Automated program Refinements) iteratively fine-tunes LLMs via hindsight learning from search traces, enabling dramatic gains in solve rates (up to 52% with ensembles) (Pourcel et al., 10 Jul 2025).

Vision-Language Synergy and Self-Correction

Vision-Language Synergy Reasoning (VLSR) and Modality-Switch Self-Correction (MSSC) are two synergistic inference strategies (Zhang et al., 19 Nov 2025):

  • VLSR: Decomposes reasoning into Phase 1 (visual rule summarization) and Phase 2 (textual rule application), using vision for holistic pattern abstraction and language for element-wise rule execution.
  • MSSC: Cross-modal error correction loop, critiquing text outputs via vision to overcome confirmation bias, yielding monotonic accuracy improvements.

Pipeline pseudocode:

1
2
3
4
5
6
7
8
9
r_pred = LVLM_vision("Summarize rule", images_in, images_out)

t_pred = LVLM_text("Apply rule", r_pred, text_grids, test_input_text)

for r in range(N_max):
    is_valid = LVLM_vision("Judge output", test_input_image, predicted_image)
    if is_valid: break
    feedback = "Your output breaks [spatial pattern X]. Revise."
    t_pred = LVLM_text("Apply rule with feedback", feedback, ...)

4. Empirical Results and Comparative Performance

Extensive benchmarking has established state-of-the-art and baseline scores:

Table: Pass@1 Accuracy for Main Approaches

Model Text-Only VLSR+MSSC SOAR Ensemble PoE+DFS
GPT-4o 8.25% 14.50%
Gemini-2.5-Pro 35.00% 42.25%
o4-mini 42.25% 46.75%
Qwen3-VL 20.25% 22.25%
SOAR Ensemble 52.00%
PoE+DFS 71.6%

5. Cognitive Modelling and Neurosymbolic Methods

Solutions informed by human cognition and symbolic reasoning have been explored:

  • Vector Symbolic Algebra-based systems combine System 1 perceptual heuristics (object segmentation, fast similarity) and System 2 symbolic program inference (minimum-hitting set search, parameter induction) (Joffe et al., 11 Nov 2025).
  • Task composition leverages symbolic operations (e.g., Extract, Recolour, Grow), employing HRRs and SSPs for object and position representations.
  • Sample efficiency and interpretability are high, but benchmark-level generalization remains limited (e.g., 10.8% on ARC-AGI-1 Train, 3.0% on Eval).

Neural Cellular Automata (NCA) variants (including EngramNCA) constitute developmental, self-organizing meta-models. These iterate local update rules across grids, showing promise in generalizing to novel grid sizes and patterns at low compute cost (Guichard et al., 13 May 2025, Xu et al., 18 Jun 2025).

6. Human Trajectories and Alignment

ARCTraj augments ARC-AGI-1 by recording and analyzing human object-level action trajectories on 400 training tasks (Kim et al., 14 Nov 2025):

  • Formalizes ARC as a finite-horizon Markov Decision Process with states as grid-object compositions and actions as symbolic transformation triplets.
  • Provides a human baseline for explainability and aligns reinforcement learning, sequence modeling, and generative methods to real human reasoning.
  • Shows that common selection biases (compact regions), color-attribution logic, and strategy grammars provide rich inductive signals for AGI model alignment.

7. Limitations, Open Questions, and Future Directions

Several limitations persist:

  • Modal switching overhead for vision–language pipelines; visualization resolution bottlenecks for large grids.
  • ARC-AGI-1’s tractability for brute-force program search with fixed DSL primitives diminishes its capacity to measure true abstraction (skills vs. intelligence) (Pfister et al., 13 Jan 2025).
  • Dataset overfitting risk due to limited private evaluation set size; index 49% of tasks are brute-force solvable (Chollet et al., 5 Dec 2024).
  • ARC-AGI-2 proposals: expand pools, rebalance difficulty, introduce procedural novelty metrics, track concept-specific progress.

Open future directions include:

In sum, the ARC-AGI-1 Benchmark remains the canonical AGI testbed for few-shot abstraction, requiring systems that synergize vision and language, symbol and neural computation, and human-like reasoning trajectories. Recent advances in evolutionary program synthesis, vision–language decomposition, and cognitive modelling have significantly shifted the frontier, though full AGI-level abstraction and transfer remain unsolved (Zhang et al., 19 Nov 2025, Franzen et al., 8 May 2025, Joffe et al., 11 Nov 2025, Guichard et al., 13 May 2025, Pourcel et al., 10 Jul 2025, Chollet et al., 5 Dec 2024, Pfister et al., 13 Jan 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ARC-AGI-1 Benchmark.