ARC-AGI-3: Interactive AGI Benchmark
- ARC-AGI-3 is a benchmark that tests agentic fluid intelligence via interactive POMDP environments, demanding exploration, planning, memory, and alignment.
- The benchmark extends static grid tasks into multi-turn reasoning problems, requiring agents to operate under partial observability and dynamic feedback.
- ARC-AGI-3 evaluates performance using metrics like success rate and action efficiency, providing a human-calibrated measure of agent adaptation in complex tasks.
The ARC-AGI-3 benchmark is the third generation in the Abstraction and Reasoning Corpus for Artificial General Intelligence, designed to evaluate agentic “fluid intelligence” through interactive reasoning tasks under conditions of novelty and exploration. Unlike its predecessors, which focused on static grid transformations, ARC-AGI-3 introduces partially observable, multi-step environments formulated as POMDPs, compelling agents to engage in exploration, planning, memory management, goal inference, and alignment with implicit constraints. This paradigm shift positions ARC-AGI-3 as a critical litmus test for evaluating general, adaptive reasoning beyond coverage of known symbolic transformations (Chollet et al., 15 Jan 2026, Foundation, 24 Mar 2026, Vahdati et al., 9 Mar 2026).
1. Formalism: Task Model and Evaluation Metrics
ARC-AGI-3 defines each “task” as a finite-horizon, partially observable Markov decision process (POMDP): where:
- : finite set of latent states (grid, objects, agent, etc.)
- : finite action set (e.g., movement, manipulation, subroutine queries)
- : set of agent observations (partial views, demonstrations, text hints)
- : transition kernel
- : observation kernel
- : binary success signal (task solved)
- : global episode action budget
Interaction proceeds via , 0, with the environment returning 1 at each step. The core evaluation metrics are:
2
- Action Efficiency (“AE”):
3
where “success” is solved within the action budget.
The evaluation protocol utilizes a three-way data split:
- Public training set (∼400 tasks: static and interactive)
- Semi-private evaluation set (∼120 tasks; interim leaderboard)
- Private evaluation set (∼120 tasks; final scoring)
Agents are evaluated in zero-shot (4) and few-shot (5) regimes. Reporting includes average success rates and average action efficiency scores for each split, with overall private leaderboard score computed as: 6 (Chollet et al., 15 Jan 2026)
2. Interactive Reasoning Categories and Benchmark Structure
ARC-AGI-3 tasks are grouped into five archetypal “interactive reasoning” categories:
- Exploration: Agents operate in procedurally generated mazes, mapping connectivity and discovering affordances (e.g., which doors open), emphasizing efficient mapping under partial observability.
- Planning: Multi-step puzzles such as Sokoban-style block rearrangement, testing lookahead, constraint satisfaction, and deadlock avoidance.
- Memory: Tasks require agents to maintain and retrieve information acquired in earlier states (e.g., recalling a symbol from the first to the 7th room), highlighting long-range credit assignment and working/external memory use.
- Goal Acquisition (Inference): Few-shot demonstrations imply the underlying reward structure; agents generalize to new instances without explicit instruction.
- Alignment: Agents must model implicit human preferences or safety constraints not directly specified by success signals (e.g., avoiding unsafe chemical mixing), targeting safe RL and meta-preference inference (Chollet et al., 15 Jan 2026).
The dataset comprises hundreds of interactive environments generated procedurally and by hand, with each task accessed via standard HTTP/JSON or gRPC protocols offering Reset/Step/Render APIs. Feedback is strictly binary (success/failure), without any shaping rewards.
3. Comparison with Predecessor Benchmarks
The transition from ARC-AGI-2 to ARC-AGI-3 introduces several major differences:
- Task Interactivity: ARC-AGI-2 features static, single-step grid transformations, whereas ARC-AGI-3 requires multi-turn, sequential interaction within environments.
- Cognitive Demand: Beyond abstraction and symbolic rule induction, ARC-AGI-3 targets new axes: exploration, memory, hypothesis management, and alignment.
- Efficiency Metrics: ARC-AGI-3 introduces action efficiency as a critical measure, directly comparing human and agent performance not only on completion, but on action economy.
- Evaluation Protocol: Unlike two-try protocols in ARC-AGI-2, ARC-AGI-3 enforces single fixed-length episodes, emphasizing anytime and in-situ reasoning (Chollet et al., 15 Jan 2026, Vahdati et al., 9 Mar 2026, Foundation, 24 Mar 2026).
Table: Key differences
| Property | ARC-AGI-2 | ARC-AGI-3 |
|---|---|---|
| Task Format | Static grid mapping | Multi-step, interactive POMDP |
| Cognitive Facets | Abstraction, rules | + Exploration, memory, planning |
| Metric | Success rate | Success rate, action efficiency |
| Episode count | 2 per input | 1 per interactive episode |
| Agent feedback | Grid I/O, static | Partial observation, sparse reward |
4. Methodological Advances and Agent Architectures
ARC-AGI-3 catalyzes development of hybrid architectures integrating reinforcement learning, program synthesis, and symbolic/planning components. The benchmark report highlights critical emergent methods:
- Refinement loops: Iterative agentic cycles—whether program-level evolutionary synthesis or application-layer controller refinement—are essential but face combinatorial explosion under partial observability, necessitating RL-driven exploration rather than pure symbolic backtracking.
- Neural mechanisms: Zero-pretraining models (e.g., tiny recursive or compression models) achieve competitive static performance, but fall short on memory and dynamic reasoning without explicit architectural extensions (e.g., attention, working memory buffers).
- Chain-of-Thought integration: LLMs require tight environment rollouts and verifiers to support experimental, rather than example-driven, agentic reasoning.
- Meta-preference and Alignment: Explicit instruction is now insufficient; agents must learn to infer implicit human values and safety invariants from minimal feedback (Chollet et al., 15 Jan 2026, Vahdati et al., 9 Mar 2026).
5. Empirical Performance and the Human–AI Gap
Empirical studies confirm a severe degradation in AI performance when extending from static reasoning to ARC-AGI-3’s interactive regime:
- State-of-the-art RL agents achieve ~12.6% action efficiency on the ARC-AGI-3 private hold-out set; humans are near 100% across >1000 level attempts (Vahdati et al., 9 Mar 2026).
- Performance gap: The best AI systems use 8 more actions than humans, with an absolute efficiency gap exceeding 87 percentage points.
- Generation gap: Drop ratio for AI from ARC-AGI-2 (9) to ARC-AGI-3 (0) is ~0.18, representing an ~82% collapse relative to successive benchmark generations, despite human performance remaining at ceiling (Vahdati et al., 9 Mar 2026).
Contributing factors:
- Compositional generalization limits: AI pipelines tuned to static, compositional rule application have no means to hypothesize or generalize from latent, emergent environment dynamics.
- Inadequate test-time adaptation: The "refinement loop" for static program synthesis is inadequate for exploration- and feedback-driven RL settings.
- Knowledge-bound reasoning: Existing models are constrained by coverage—either from training on synthetic corpora or libraries—rather than open-ended concept induction (Vahdati et al., 9 Mar 2026, Chollet et al., 15 Jan 2026).
6. Benchmark Construction, Validation, and Human Anchoring
ARC-AGI-3 environments are hand-authored and procedurally generated to avoid trivial variation and ensure compositional difficulty. Key validation features:
- Design constraints: Multiple mechanics per environment; later levels require integrated reasoning beyond basic primitives.
- Automated safeguards: Random-play and fuzzing must yield 1 on non-tutorial levels, enforced via graph-based state-space exploration (Foundation, 24 Mar 2026).
- Human calibration: Each candidate environment requires at least two independent first-solve completions out of ten untrained participants to ensure universal human solvability.
- Scoring framework: Relative Human Action Efficiency (RHAE) underpins all evaluation, using the second-best first-run human completion as the benchmark for each level. Scores are quadratic in inefficiency (e.g., using 2 more actions yields 1% credit), with all levels capped at 5× human action count (Foundation, 24 Mar 2026).
7. Implications and Future Directions
ARC-AGI-3 establishes a rigorous environment for isolating the mechanisms of general, agentic reasoning under novelty. Key implications:
- Unsolved dimensions: Compositional abstraction, exploration-planning synergies, and meta-preference reasoning remain unsolved in current models; benchmark remains unsaturated as of March 2026 (Vahdati et al., 9 Mar 2026, Foundation, 24 Mar 2026).
- Agent design: Progress requires genuinely hybrid systems integrating RL exploration/reasoning, robust memory architectures, and program synthesis in closed feedback loops.
- Benchmarking strategy: Avoidance of synthetic overfitting, enforcement of human-anchored efficiency metrics, and systematic separation of knowledge-bound versus skill-acquisition efficiency are essential for tracking AGI progress.
ARC-AGI-3 provides a uniquely calibrated test of fluid intelligence that compels agents to learn, plan, and align with environment dynamics on-the-fly, establishing the reference point for any claim of progress towards artificial general intelligence (Chollet et al., 15 Jan 2026, Foundation, 24 Mar 2026, Vahdati et al., 9 Mar 2026).