ARC-AGI-3 Benchmark

Updated 29 March 2026

ARC-AGI-3 Benchmark is an evaluation suite that uses interactive, turn-based environments to measure agentic intelligence through core knowledge primitives.
It employs formal environment specifications and an efficiency-based scoring framework, notably using Relative Human Action Efficiency to compare agents against human baselines.
Empirical results reveal a dramatic performance gap, with current AI systems achieving less than 1% of human-level action efficiency, underscoring challenges in adaptive reasoning.

The ARC-AGI-3 benchmark is an interactive, efficiency-based evaluation suite designed to measure agentic intelligence in artificial agents. It radically extends the Abstraction and Reasoning Corpus (ARC) paradigm—which originally assessed pattern induction over static grid transformations—by introducing novel, turn-based environments that explicitly require exploration, hypothesis formation, planning, memory, autonomous goal acquisition, and robustness to underspecified or shifting goals. ARC-AGI-3 environments are constructed solely from “Core Knowledge” primitives, eschewing language and cultural knowledge, to measure fluid intelligence grounded in general reasoning rather than memorized skill repertoires. Human participants consistently solve all environments, whereas as of early 2026, leading AI systems achieve less than 1% of human-level action efficiency, thereby characterizing the unsolved gap in open-ended interactive reasoning (Foundation, 24 Mar 2026, Vahdati et al., 9 Mar 2026).

1. Formal Environment Specification and Design Principles

ARC-AGI-3 environments are defined as tuples $e = (S, A, O, T, s_0, \mathcal{W})$ :

State Space (S): All possible environment configurations, minimally including a 64×64 grid ( $\in\{0,\dots,15\}^{64×64}$ ) of discrete cell values and any latent variables required for local dynamics or object properties.
Action Space (A): Unified and environment-agnostic; consists of five canonical “key” actions (e.g., move, jump, pick up, interact, advance), an Undo operation, and a grid-selection primitive ( $\mathit{select}(i,j)$ ) for $1 \leq i,j \leq 64$ . Each step by the agent corresponds to one counted action.
Observation Model (O): Maps each state $s$ to a percept $o$ , generally the visible frame (grid), optionally with short non-interactive animations to signal environment progression.
Transition Function (T): Deterministic step function $T : S × A \rightarrow S$ , which updates the state according to the agent's action.
Initial State ( $s_0$ ): The reset state for the first level.
Win Condition ( $\mathcal{W}$ ): Terminal predicate over S associated with each level, triggering level completion.

Each environment comprises five sequential levels, each more conceptually demanding than the last and built as variations on the core mechanics. Crucially, all environmental mechanisms (e.g., objects, physics, topology, interaction affordances) are restricted to variations on Core Knowledge primitives: object individuation and persistence, relational geometry/topology, qualitative physics (gravity, collision), simple agentness, but no language, numerals, or culturally-dependent tokens (Foundation, 24 Mar 2026). This ensures that success depends exclusively on fluid abstract reasoning and exploration.

2. Evaluation Metrics and Efficiency-Based Scoring Framework

ARC-AGI-3 introduces Relative Human Action Efficiency (RHAE) as its primary scoring axis, measuring the normalized efficiency of an agent's learning and problem solving relative to human baselines. For each environment, level, and the entire benchmark, scores are power-law-normalized and weighted for conceptual difficulty:

Level Efficiency Score ( $S_{l,e}$ ):

$\in\{0,\dots,15\}^{64×64}$ 0

Where $\in\{0,\dots,15\}^{64×64}$ 1 is the agent’s action count, $\in\{0,\dots,15\}^{64×64}$ 2—the second-best human baseline count on first exposure—quantifies efficient human exploration. The squaring imposes a strong penalty for inefficiency (doubling the baseline actions halves the score).

Environment Score ( $\in\{0,\dots,15\}^{64×64}$ 3):

$\in\{0,\dots,15\}^{64×64}$ 4

Later levels weigh more heavily, amplifying the penalty for inefficient or unsuccessful reasoning as environments progress.

Overall Benchmark Score (T):

$\in\{0,\dots,15\}^{64×64}$ 5

Here, $\in\{0,\dots,15\}^{64×64}$ 6 is the set of private environments constituting the official test set. $\in\{0,\dots,15\}^{64×64}$ 7 ranges from 0 (no progress beyond random actions) to 1 (matching or exceeding human efficiency everywhere).

No partial credit is awarded for incomplete problem-solving: all scoring is relative to full solution under human-efficient resource budgets. This metric strictly penalizes brute-force approaches and enforces a formal comparison between agentic and human fluid intelligence (Foundation, 24 Mar 2026, Vahdati et al., 9 Mar 2026).

3. Human Calibration and Environment Validation

Difficulty calibration is achieved through extensive human testing and controlled experimental protocols:

Each candidate environment is attempted by at least ten untrained human participants in a single-shot session (20–30 min limit per environment), with at least two participants required to solve all levels for the environment to be admitted.
Baseline action counts are set as the second-best among all successful attempts, capturing not just optimal play but also the exploratory cost on first exposure.
Statistical analyses (per-level completion rates, video review of bottlenecks) guide iterative environment revision. Environments giving rise to ambiguous or intractable mechanics are revised or discarded.
Automated graph analysis ensures that no non-tutorial level is solvable by random play within 1,000,000 steps, and that environment deterministic execution, state coverage, and mechanical diversity are maintained.

Only environments passing both the automated and human calibration phases are eligible for the public and private benchmark splits. This dual-pronged procedure simultaneously ensures tractability, difficulty, and novelty (Foundation, 24 Mar 2026).

4. Evolution from Previous ARC-AGI Versions

ARC-AGI-3 marks a decisive departure from the static, demonstration-based input–output grid transformations of ARC-AGI-1 and 2. The historical progression is summarized as follows:

Aspect	ARC-AGI-1/2	ARC-AGI-3
Task Format	Static grid I/O pairs	Turn-based interactive environments
Human Effort	~30s (V1), ~300s (V2)	~8 min per environment
Cognitive Demands	Analogy, abstraction	Exploration, planning, memory, alignment
Interactivity	None (one-shot inference)	Agent ↔ env. action–percept loop
Benchmark Metric	Accuracy over test outputs	Action efficiency relative to human
AI/Human Gap	Shrinking (1.0%→84.6%)	Expanding (<1% vs human 100%)

The expansion to interactive, sequential, and partially observable environments fundamentally changes the compositional and inductive demands: agents must now learn to form world-models, update latent hypotheses, plan across uncertain dynamics, and adapt on-the-fly—all without recourse to language, prior world knowledge, or static plan libraries (Foundation, 24 Mar 2026, Vahdati et al., 9 Mar 2026, Chollet et al., 15 Jan 2026).

5. Task Spectrum and Cognitive Demands

The ARC-AGI-3 benchmark specifically targets five axes of core agentic cognition:

Exploration: Inferring hidden environment mechanics by acting efficiently under uncertainty.
Planning: Computing and executing multi-step action sequences to reach implicit goal states with minimal cost.
Memory: Retaining and utilizing information acquired across episodes or temporally extended trials.
Goal Acquisition: Inferring task objectives from sparse or delayed feedback, as no explicit task descriptions are given.
Alignment: Maintaining intended behavior in the presence of ambiguous instructions or shifting, partially adversarial environment conditions.

This broader cognitive spectrum is enforced by environment design and empirical validation, and is absent from legacy benchmarks relying purely on pattern taxonomy or fixed symbolic inference (Foundation, 24 Mar 2026, Chollet et al., 15 Jan 2026).

6. Empirical Results and Comparative Analysis

Empirical findings across ARC-AGI-3 and its preview releases underscore the step-change in difficulty:

Human Baseline: All 1,000+ levels, spanning 150+ mini-game environments, are solvable with median success and action efficiency of 100%. The median successful human attempt lasts 8.1 minutes (Foundation, 24 Mar 2026, Vahdati et al., 9 Mar 2026).
Frontier AI Performance: As of March 2026:
- Gemini 3.1 Pro Preview: 0.37%
- GPT-5.4 (High): 0.26%
- Opus 4.6 (Max): 0.25%
- StochasticGoose (best pre-2026 RL): 12.58% (preview, public demo set)
- LLMs, program synthesis, and graph-based search remain below 1% on the official private dataset (Foundation, 24 Mar 2026, Vahdati et al., 9 Mar 2026, Rudakov et al., 30 Dec 2025).
Magnitude of the Gap: The leading ARC-AGI-2 system reached 84.6% accuracy; the best ARC-AGI-3 systems are over 6x less effective relative to human efficiency, with the absolute gap expanding as the environment complexity increases (Vahdati et al., 9 Mar 2026).
Graph Exploration Baseline: Training-free, deterministic graph-based state-action exploration achieves a median of 30 out of 52 levels (preview set), outperforming both random and LLM-based solvers through systematic, memory-enhanced exploration but remaining far below human proficiency (Rudakov et al., 30 Dec 2025).

The failure of “brute-force” or language-exploiting pipelines—previously effective due to exhaustive searching over compositional libraries—directly demonstrates the necessity of environment modelling, hypothesis-driven action, and compositional generalization in ARC-AGI-3 (Pfister et al., 13 Jan 2025, Foundation, 24 Mar 2026).

7. Significance, Limitations, and Open Research Challenges

ARC-AGI-3 constitutes a new gold standard for measuring fluid, generalizable intelligence under genuine environmental uncertainty and interaction:

Significance: The benchmark’s design—based purely on core, language-free priors—renders overfitting via linguistic pretraining infeasible, sharply distinguishing adaptive reasoning from rote skill.
Open Bottlenecks:
- Lack of compositional hierarchical abstraction prevents agents from compressing and transferring strategies across tasks; most AI solutions remain at the low-level action or reactive policy stratum (Vahdati et al., 9 Mar 2026).
- Sparse feedback complicates credit assignment and world-model induction, necessitating advances in meta-reasoning, episodic memory, and exploration-exploitation trade-off strategies.
- Symbol grounding in interactive contexts and the automatic formation of world-mechanic grammars remain unsolved.
- Action efficiency as a metric foregrounds energy, time, and sample usage—emphasizing resource-constrained, human-like reasoning instead of unlimited search.
Future Directions: Prospective advances are hypothesized in adaptive computation, hybrid neural-symbolic modeling, explicit memory mechanisms, attention-augmented cell-based computation, and meta-controller architectures. The expected evolution involves reporting not only action efficiency but also sample and level-transfer metrics, with human-aligned difficulty calibration maintained as the core baseline (Foundation, 24 Mar 2026, Vahdati et al., 9 Mar 2026, Xu et al., 18 Jun 2025).

ARC-AGI-3 is thus central to contemporary AGI benchmarking, providing a rigorously validated probe of adaptive reasoning in a knowledge-constrained, interaction-first domain where human fluid intelligence remains the reference standard.