ARC-AGI-1 Benchmark for AGI Evaluation
- ARC-AGI-1 Benchmark is a testbed for AGI systems focusing on few-shot rule induction and abstract reasoning using hand-crafted grid tasks.
- It employs both textual and visual encodings with metrics like Pass@1 to assess performance across 1,000 diverse tasks.
- The benchmark drives advances in program synthesis, vision-language synergy, and cognitive modeling by challenging out-of-distribution generalization.
The ARC-AGI-1 Benchmark is a rigorous testbed for evaluating artificial general intelligence (AGI) systems’ ability to perform abstract reasoning and rule induction from minimal examples. It comprises diverse, hand-crafted grid-based tasks designed to mirror “learning how to learn” in humans, specifically focusing on few-shot conceptual rule induction and transfer to novel problems. ARC-AGI-1 tasks have become central to measuring out-of-distribution generalization, compositional reasoning, and cognitive plausibility for frontier foundation models (Zhang et al., 19 Nov 2025).
1. Formal Definition and Objectives
ARC-AGI-1 presents each solver with a set of tasks , each defined by input–output training pairs and a hidden test input:
- Each grid for , with cell values .
- Task consists of and one .
- The underlying transformation is always a deterministic rule , which must be inferred from a handful of examples.
- Modalities:
- Textual encoding:
- Visual encoding:
- Both and are invertible.
The principal evaluation metric is exact match (Pass@1) accuracy over a holdout set:
This assesses whether the model's predicted output grid matches the ground truth for each test input, over tasks.
2. Benchmark Structure, Modalities, and Metrics
ARC-AGI-1 contains 1,000 total tasks with 400 public “train,” 400 public “eval,” and 200 private tasks, each requiring solvers to generalize across transformations such as symmetries, counting, object manipulation, and spatial relations (Chollet et al., 5 Dec 2024).
- Each task only shares the high-level structure—no two have the same transformation logic.
- Input/output grids are either serialized as text sequences or rendered as color-coded images.
- Base metrics:
- Pass@1 (single guess per test)
- Two-guess accuracy (for competitions): proportion of tasks solved in up to two guesses.
Table: Task Split and Metrics
| Split | # Tasks | Main Metric |
|---|---|---|
| Train | 400 | Pass@1 |
| Public Eval | 400 | Pass@1, 2-guess |
| Private Eval | 200 | Pass@1 |
Significance: ARC-AGI-1 is specifically designed to require generalization on previously unseen rules, precluding brute-force memorization and stressing abstract reasoning (Pfister et al., 13 Jan 2025).
3. Algorithmic Paradigms and Synergistic Strategies
ARC-AGI-1 has catalyzed the development of several advanced solver paradigms.
Program Synthesis and Evolutionary Search
- Deep learning-guided program synthesis utilizes code LLMs to generate candidate programs in small DSLs, executed and filtered against training examples (Chollet et al., 5 Dec 2024, Pourcel et al., 10 Jul 2025).
- SOAR (Self-Improving Operators for Automated program Refinements) iteratively fine-tunes LLMs via hindsight learning from search traces, enabling dramatic gains in solve rates (up to 52% with ensembles) (Pourcel et al., 10 Jul 2025).
Vision-Language Synergy and Self-Correction
Vision-Language Synergy Reasoning (VLSR) and Modality-Switch Self-Correction (MSSC) are two synergistic inference strategies (Zhang et al., 19 Nov 2025):
- VLSR: Decomposes reasoning into Phase 1 (visual rule summarization) and Phase 2 (textual rule application), using vision for holistic pattern abstraction and language for element-wise rule execution.
- MSSC: Cross-modal error correction loop, critiquing text outputs via vision to overcome confirmation bias, yielding monotonic accuracy improvements.
Pipeline pseudocode:
1 2 3 4 5 6 7 8 9 |
r_pred = LVLM_vision("Summarize rule", images_in, images_out) t_pred = LVLM_text("Apply rule", r_pred, text_grids, test_input_text) for r in range(N_max): is_valid = LVLM_vision("Judge output", test_input_image, predicted_image) if is_valid: break feedback = "Your output breaks [spatial pattern X]. Revise." t_pred = LVLM_text("Apply rule with feedback", feedback, ...) |
- VLSR + MSSC: Average gain of 4.33% over text-only baselines across major models (Zhang et al., 19 Nov 2025).
4. Empirical Results and Comparative Performance
Extensive benchmarking has established state-of-the-art and baseline scores:
- Frontier LLMs (GPT-4o, Gemini, Qwen3-VL, o4-mini): Text-only accuracy ranges from 8.25% to 42.25% (Zhang et al., 19 Nov 2025).
- VLSR: +3.02% improvement on average.
- MSSC: +1.82% improvement.
- Combined VLSR+MSSC: Up to 4.33% over baseline.
- SOAR ensemble: 52% tasks solved (Pourcel et al., 10 Jul 2025).
- Product-of-Experts with LLMs and systematic data augmentations reaches 71.6% at $0.02 per task—surpassing average human baseline (Franzen et al., 8 May 2025).
- Best private eval scores (ARC Prize 2024): ~55.5% (Chollet et al., 5 Dec 2024).
Table: Pass@1 Accuracy for Main Approaches
| Model | Text-Only | VLSR+MSSC | SOAR Ensemble | PoE+DFS |
|---|---|---|---|---|
| GPT-4o | 8.25% | 14.50% | — | — |
| Gemini-2.5-Pro | 35.00% | 42.25% | — | — |
| o4-mini | 42.25% | 46.75% | — | — |
| Qwen3-VL | 20.25% | 22.25% | — | — |
| SOAR Ensemble | — | — | 52.00% | — |
| PoE+DFS | — | — | — | 71.6% |
5. Cognitive Modelling and Neurosymbolic Methods
Solutions informed by human cognition and symbolic reasoning have been explored:
- Vector Symbolic Algebra-based systems combine System 1 perceptual heuristics (object segmentation, fast similarity) and System 2 symbolic program inference (minimum-hitting set search, parameter induction) (Joffe et al., 11 Nov 2025).
- Task composition leverages symbolic operations (e.g., Extract, Recolour, Grow), employing HRRs and SSPs for object and position representations.
- Sample efficiency and interpretability are high, but benchmark-level generalization remains limited (e.g., 10.8% on ARC-AGI-1 Train, 3.0% on Eval).
Neural Cellular Automata (NCA) variants (including EngramNCA) constitute developmental, self-organizing meta-models. These iterate local update rules across grids, showing promise in generalizing to novel grid sizes and patterns at low compute cost (Guichard et al., 13 May 2025, Xu et al., 18 Jun 2025).
6. Human Trajectories and Alignment
ARCTraj augments ARC-AGI-1 by recording and analyzing human object-level action trajectories on 400 training tasks (Kim et al., 14 Nov 2025):
- Formalizes ARC as a finite-horizon Markov Decision Process with states as grid-object compositions and actions as symbolic transformation triplets.
- Provides a human baseline for explainability and aligns reinforcement learning, sequence modeling, and generative methods to real human reasoning.
- Shows that common selection biases (compact regions), color-attribution logic, and strategy grammars provide rich inductive signals for AGI model alignment.
7. Limitations, Open Questions, and Future Directions
Several limitations persist:
- Modal switching overhead for vision–language pipelines; visualization resolution bottlenecks for large grids.
- ARC-AGI-1’s tractability for brute-force program search with fixed DSL primitives diminishes its capacity to measure true abstraction (skills vs. intelligence) (Pfister et al., 13 Jan 2025).
- Dataset overfitting risk due to limited private evaluation set size; index 49% of tasks are brute-force solvable (Chollet et al., 5 Dec 2024).
- ARC-AGI-2 proposals: expand pools, rebalance difficulty, introduce procedural novelty metrics, track concept-specific progress.
Open future directions include:
- Joint multimodal training for vision–LLMs beyond zeroshot inference (Zhang et al., 19 Nov 2025).
- LLM–NCA hybrid pipelines, adaptive search via neural priors, enhanced representation learning for symbolic compositionality (Guichard et al., 13 May 2025).
- Universal benchmarks based on diversity-weighted, skill-generation metrics rather than fixed puzzle sets (Pfister et al., 13 Jan 2025).
- Human-aligned learning strategies and intention prediction modules derived from ARCTraj (Kim et al., 14 Nov 2025).
In sum, the ARC-AGI-1 Benchmark remains the canonical AGI testbed for few-shot abstraction, requiring systems that synergize vision and language, symbol and neural computation, and human-like reasoning trajectories. Recent advances in evolutionary program synthesis, vision–language decomposition, and cognitive modelling have significantly shifted the frontier, though full AGI-level abstraction and transfer remain unsolved (Zhang et al., 19 Nov 2025, Franzen et al., 8 May 2025, Joffe et al., 11 Nov 2025, Guichard et al., 13 May 2025, Pourcel et al., 10 Jul 2025, Chollet et al., 5 Dec 2024, Pfister et al., 13 Jan 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free