Papers
Topics
Authors
Recent
Search
2000 character limit reached

ARC Prize 2025: AI Innovation Competition

Updated 20 May 2026
  • ARC Prize 2025 is an international competition that advances AI systems through rigorous few-shot generalization, compositional reasoning, and fluid intelligence tests.
  • The competition introduced the ARC-AGI-2 benchmark with complex grid tasks (up to 30x30) that require both symbolic and neural adaptation methods.
  • Key innovations include diverse refinement loops—evolutionary, application-layer, and weight-space—that have driven methodological breakthroughs and significant cost reductions.

The ARC Prize 2025 is an international competition designed to evaluate and advance the state of artificial intelligence systems on the ARC-AGI-2 benchmark, a grid-based task suite probing few-shot generalization, compositional reasoning, and fluid intelligence. Building on the original ARC-AGI-1, ARC-AGI-2 introduces greater task complexity, necessitating both symbolic and neural methods capable of per-task adaptation. The 2025 event catalyzed methodological innovations around the concept of the refinement loop, marked substantial industrial standardization, and highlighted persistent limitations in compositional and interactive generalization.

1. ARC-AGI-2 Benchmark and Competition Structure

ARC-AGI-2 extends the grid transformation paradigm of ARC-AGI-1, featuring input/output grids up to 30×3030 \times 30, with $10$ discrete cell values, and maintains two test attempts. Task design increases abstract reasoning complexity by incorporating more intricate shape and color transformations, multi-step logical dependencies, and a normalized difficulty distribution to eliminate the prior skew toward easier instances. Only "Core Knowledge" priors (objectness, topology, integer arithmetic) are retained to rigorously test compositional abstraction.

Each task presents $2$–$5$ demonstration input/output pairs for few-shot inference, with $1$–$3$ held-out test inputs. The dataset comprises three non-overlapping partitions (totaling $640$ tasks): $400$ public train tasks (from ARC-AGI-1), $120$ semi-private evaluation tasks (used for API-accessible, partially exposed leaderboards), and $120$ private held-out tasks for final scoring. The competition ran March 26 – November 3, 2025, attracting $10$0 teams and $10$1 submissions, with $10$2 paper entries—a near doubling year-over-year. Final evaluation used the private split with exact-match accuracy:

$10$3

The highest score achieved on this set was $10$4 (NVARC, $10$5 tasks solved) (Chollet et al., 15 Jan 2026, Vahdati et al., 9 Mar 2026).

2. Emergence and Taxonomy of Refinement Loops

The central innovation of ARC Prize 2025 is the per-task "refinement loop": a closed-loop, feedback-driven process that incrementally transforms a candidate solver (program or model) into a superior one based on verifier signals, loss, or direct reward. Distinct refinement loop variations include:

  • Evolutionary Program Synthesis: Operates in program space (symbolic or natural language), with population-based exploration, candidate verification, and selection via correctness scores. Breeding and mutation guide optimization.
  • Application-Layer Refinements ("Harnesses"): Encompasses Chain-of-Thought (CoT) prompting with intermediate verification, error-driven re-prompting, and iterative grid prediction. This flavor dominates industrial CoT harness paradigms.
  • Weight-Space Loops (Deep Learning): Includes test-time fine-tuning of pretrained weights on per-task demonstrations, and zero-pretraining methods, which initialize and adapt small neural architectures from scratch via gradient descent on the task alone.

All top-performing Kaggle entries in 2025 adopted refinement loops, either in symbolic program-space, weight-space, or as application-layer iterative wrappers (Chollet et al., 15 Jan 2026, Vahdati et al., 9 Mar 2026).

3. Winning Methods and Implementation Details

Top solutions spanned a diversity of architectures, ensembling strategies, and synthetic data pipelines. The leaders, as summarized below, all incorporated refinement mechanisms and test-time or per-task adaptation, with heavy reliance on synthetic data augmentation for coverage.

Place Team Score Refinement Loop Flavor
1st NVARC 24.03 % Weight-space + data synthesis
2nd The ARChitects 16.53 % Weight-space + recursive CoT
3rd MindsAI 12.64 % Weight-space test-time training
4th Lonnie 6.67 % Evolutionary program synthesis
5th G. Barbadillo 6.53 % Hybrid search + learning

4. Industry Benchmarks, Model Standardization, and Cost Analyses

ARC-AGI-2 catalyzed industry-wide benchmarking, with Anthropic (Opus 4.5), Google DeepMind (Gemini 3), OpenAI (“ChatGPT thinking mode”), and xAI all reporting model card results. Application-layer refinement via CoT harnesses and verifiers enabled leading models to raise scores from $2$0 to $2$1 (e.g., Gemini 3 Pro baseline $2$2, $2$3 with Poetiq harness; Opus 4.5 baseline $2$4, $2$5 with application-layer refinement).

Cost per task decreased precipitously over one year: OpenAI o3 ($2$6 at \$2790.5%790.5\%2$811.64/task), corresponding to a $2$9 cost reduction. On ARC-AGI-2, private leaderboard costs for the top Kaggle methods were $5$00.20$5$1>\$5$2/task. However, scoring efficiency (points per dollar) remained much higher for Kaggle-constrained methods (Vahdati et al., 9 Mar 2026, Chollet et al., 15 Jan 2026).

5. The Role and Limits of Tiny and Zero-Pretrained Networks

ARC Prize 2025 recognized the performance of zero-pretraining few-shot learners:

  • Tiny Recursive Model (TRM, Jolicoeur-Martineau, 1st Paper Award): A 7M parameter architecture iteratively updates latent and output via 16 refinement steps per test. It achieved $5$3 on ARC-AGI-1, $5$4 on ARC-AGI-2.
  • CompressARC (Liao & Gu, 3rd Paper Award): A $5$5K parameter VAE/MDL-driven network, trained from scratch per task, solving ARC-AGI-1 puzzles in $5$6 minutes per instance at $5$7 accuracy.

These approaches demonstrated that task-specific gradient-based adaptation without pretraining is competitive on certain instances, but flagged the increasing difficulty and generalization gap from ARC-AGI-1 to ARC-AGI-2, particularly for compositional tasks (Chollet et al., 15 Jan 2026).

6. Generalization, Benchmark Contamination, and Open Challenges

Despite procedural and architectural progress, performance on compositional generalization is constrained. The observed failure ratio from ARC-AGI-1 to ARC-AGI-2 for frontier models is $5$8–$5$9, with a drop from $1$0 (ARC-AGI-1, Opus 4.6) to $1$1 (ARC-AGI-2), and a sharper cliff for Kaggle-constrained solutions ($1$2–$1$3). Human expert panels maintain $1$4 accuracy, with average individual accuracy at $1$5 (ARC-AGI-2).

A new contamination vector was identified: large LLMs exhibited knowledge-dependent overfitting due to memorization of public ARC-related artifacts (e.g., color codes, JSON-grid formats) present in public repositories, not merely train-test leakage. LLM outputs mapped color codes correctly without explicit ARC task cues, suggesting a knowledge coverage rather than genuine abstraction-driven generalization.

Key unresolved challenges include:

  • Extending compositional generalization beyond 2–3 reasoning steps (intractable search space growth, lack of hierarchical decomposition).
  • Symbol grounding, with models failing to acquire abstract, transferable primitive concepts (e.g., “rotation”).
  • Transfer from static puzzle solving to interactive task settings, which static refinements and per-task loops do not address (Vahdati et al., 9 Mar 2026).

7. Transition to Interactive ARC-AGI-3 and Future Directions

ARC-AGI-3, previewed late 2025 and to be released in early 2026, introduces interactive “mini-worlds” to test exploration, planning, persistent memory, goal acquisition, and alignment. The new metric will jointly evaluate task success rate and action efficiency, enabling formal comparison of human and AI learning efficiency. Early results on ARC-AGI-3 revealed a substantial gap: the top AI achieved $1$6 action-efficiency versus $1$7 for humans at minimal effort.

Moving beyond per-task refinement loops, ARC-AGI-3 is expected to require “true agentic reasoning,” encompassing online goal inference and hypothesis-driven exploration. This shift addresses both benchmark contamination and the limitations of static, knowledge-bounded approaches (Chollet et al., 15 Jan 2026, Vahdati et al., 9 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ARC Prize 2025.