ARC-AGI-2 Dataset Benchmark

Updated 2 December 2025

The ARC-AGI-2 dataset is a benchmark for evaluating abstract reasoning in 2-D grid tasks, advancing the measurement of human-like intelligence.
It employs a demonstration-based input–output format and calibrated evaluation splits to ensure consistent difficulty and fair performance comparison.
Empirical metrics from human and AI baselines validate its capacity to probe object persistence, number sense, and compositional reasoning.

The ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence 2) dataset constitutes an upgraded benchmark designed to rigorously evaluate the general fluid intelligence of artificial systems. Building on the foundational ARC-AGI introduced in 2019, ARC-AGI-2 delivers a more granular, calibrated, and human-accessible set of tasks targeting higher-order abstract reasoning and problem-solving abilities within 2-D grid environments. Task design, validation, and partitioning are explicitly engineered to measure progress toward general, human-like intelligence under tightly controlled evaluation conditions (Chollet et al., 17 May 2025).

1. Task Format and Data Specification

ARC-AGI-2 adheres to a demonstration-based input–output format, where each task $T$ consists of a minimal set of demonstration pairs $\mathcal{E}_\text{demo} = \{(X_i, Y_i)\}$ and one or more test inputs $X_\text{test} = \{X_j\}$ . Each $X$ , $Y$ is a 2-dimensional grid represented as a row-major nested JSON array of integer color codes from $\{0,1,\dots,9\}$ . Grid dimensions satisfy $1 \leq \text{height}(X), \text{width}(X) \leq 30$ .

Key data conventions:

No global color semantics: palette usage is per-task, and semantic interpretation must be deduced locally.
Test-pair examples ( $X_j$ , $Y_j$ ) are withheld from agents for scoring and are not seen during example provision.
All cell encoding adheres to the row-major JSON format; demonstration and test pairs follow the same structural convention.

2. Task Inventory, Partitioning, and Evaluation Splits

ARC-AGI-2 partitions tasks into three primary evaluation sets, all calibrated such that their mean human accuracies differ by at most 1 percentage point:

Public Evaluation set
Semi-Private Evaluation set (120 tasks)
Private Evaluation set (120 tasks)

Additionally, a Public Training set comprising validated tasks for pre-training and demonstration is provided. This training set is not difficulty-calibrated, incorporates a broad range of problem complexities, and expands as new tasks are validated. During events such as ARC Prize 2025, systems must solve 240 previously unseen tasks (120 semi-private + 120 private) under fixed compute constraints.

3. Difficulty Metrics and Targeted Cognitive Skills

Task difficulty is empirically measured. For each task $t$ , if $N_t$ denotes the number of unique human participants who attempted at least one test pair and $C_t$ those who solved all test pairs correctly on their first two attempts, the human accuracy is $a_t = C_t / N_t$ and the nominal task difficulty $d_t = 1 - a_t$ . Evaluation set calibration ensures $|A_\text{pub} - A_\text{semi}| \leq 0.01$ , $|A_\text{pub} - A_\text{priv}| \leq 0.01$ , $|A_\text{semi} - A_\text{priv}| \leq 0.01$ , with $A_\cdot$ the mean accuracies of each partition.

The dataset is designed to probe:

Object persistence and tracking
Elementary number sense (e.g., counting, parity)
Geometric/topological reasoning (connectivity, symmetry)
Compositional generalization, including:
- Multi-rule and multi-step sequential reasoning
- Contextual (conditional) rule application
- In-context symbol definition

No fixed discrete difficulty levels are provided; rather, the dataset constitutes a continuum of difficulty empirically anchored to human performance data.

4. Task Generation, Filtering, and Validation Protocol

The task curation pipeline comprises the following stages: a. Over-generation: Candidate tasks originate from new authoring or are adapted from previously validated ARC-AGI-1 and unvalidated ARC-AGI-2 drafts; ARC-AGI-1 public-training tasks are excluded. b. Human Screening: Each candidate advances only if at least two independent human participants solve at least one test pair within their first two attempts. c. Difficulty Calibration: Tasks are grouped into set partitions such that mean human accuracies differ by at most 1 percentage point, with fresh tasks preferentially allocated to the Private set. d. Redundancy Detection: A panel inspects surviving tasks, removing any pair that can be solved by a single general solution, ensuring non-redundancy and coverage diversity. e. Final Validation: Every evaluation set task is solved by two independent external reviewers and an additional internal reviewer distinct from the author. Cell-level artifacts in training examples are corrected if present, but must not affect test-pair solvability.

5. Empirical Dataset Statistics and Human Baselines

From human calibration experiments ( $N_\text{pairs} = 1,848$ unique test pairs):

Test-pair counts per task: 68% have 1 pair, 29% have 2, 3% have 3, and $<$ 1% have 4
Total human attempts: 13,405; successful solves: 8,277 (aggregate 62% success rate)
Mean attempts per test pair: $\approx 7.3$
Median time per attempted test pair: 2.3 minutes (successful: 2.2 minutes)
Post-selection mean human accuracy: $\bar{a} \approx 0.75$
Median per-participant solve rate: $\approx 66\%$ per attempt
Grid size: $[1 \times 1,\, 30 \times 30]$ ; color palette size: $\leq 10$ per task

Human accuracy over evaluation subsets is defined as: $\mathrm{HumanAccuracy}(T) = \frac{1}{|T|} \sum_{t \in T} \mathbf{1}(\text{task } t \text{ fully solved})$ Observed median solution time: $\approx 2.7$ min/task.

6. AI Baseline Performance and Scoring Protocol

AI systems are evaluated on held-out test pairs using the same accuracy metric as human baselines. On the Semi-Private Evaluation set (120 tasks), current system performance (as of May 14, 2025) is as follows:

Model	Compute Regime	Accuracy (%)
o3-mini	High-compute	3.0
o3	Medium-compute	3.0
ARChitects (2024)	—	2.5
o4-mini	Medium	2.4
Icecuber (2020)	—	1.6
o1-pro	Low	0.9
Claude 3.7 (8K)	—	0.9

Models scoring below $\approx 5\%$ are considered to operate at "noise-level"; performance distinguishable from noise is judged to emerge above $\simeq 5\%$ accuracy.

7. Licensing, Distribution, and Research Usage

The Public Training set and baseline implementations are freely available at the ARC Prize GitHub repository. The Semi-Private test set is visible to competitors on the public leaderboard, while the Private set is withheld until code submissions are open-sourced. Competition participation mandates code runs offline (no internet), full code release, and adherence to hardware limits (4 NVIDIA L4 GPUs, 12h wall time). Dataset usage, permissible model eligibility, and detailed policy are available at https://arcprize.org/policy. All tasks are released under the ARC Prize Foundation’s standard benchmark license, expressly prohibiting prior memorization or any form of external leakage. Proper citation is required for publications utilizing ARC-AGI-2 (Chollet et al., 17 May 2025).

PDF Markdown Chat (Pro)

References (1)

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ARC-AGI-2 Dataset.