ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems (2505.11831v1)

Published 17 May 2025 in cs.AI

Abstract: The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark's accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.

PDF Abstract

Overview of ARC-AGI-2: Assessing Frontier AI Reasoning

The paper "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems" introduces an updated benchmark called ARC-AGI-2, an evolution of the original Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI-1), initiated in 2019. As AI technology rapidly advances, the need for robust measures of AI capabilities becomes paramount, and ARC-AGI-2 addresses this by presenting tasks of higher cognitive complexity. The benchmark is crafted to evaluate and drive development of artificial systems toward general fluid intelligence akin to that exhibited by humans.

Historical Context and Development

ARC-AGI-1 was conceived to evaluate fluid intelligence, a trait characterized by the ability to solve novel problems independently of domain-specific knowledge. The original benchmark comprises grid-based reasoning tasks, which demand the deduction of transformation rules from example pairs. Over the years since its inception, ARC-AGI has experienced notable competitions, yet the scores achieved by AI systems have consistently lagged behind human baseline performance, underscoring the benchmark’s difficulty for current AI models.

Core Features and Limitations of ARC-AGI-1

Three fundamental characteristics of ARC-AGI-1 include resistance to overfitting, minimal prior knowledge requirements, and feasibility for human solvers. Despite these characteristics, several limitations were identified:

Task Susceptibility: Some tasks can be circumvented using brute-force methods, thus undermining the benchmark’s intent to assess genuine reasoning skills.
Lack of Reliable Human Baselines: The absence of standardized human data regarding task solvability limits the ability to gauge AI performance relative to human competence.
Saturation Point: The benchmark does not capture the full spectrum of human intelligence due to its difficulty limitations.
Difficulty Distribution: Uneven distribution of task complexity across subsets complicates the interpretation of AI performance.
Information Leakage Risk: Continuous reuse of private evaluation tasks risks AI systems tuning to these specific tasks, rather than developing general reasoning abilities.

Advancements with ARC-AGI-2

ARC-AGI-2 addresses the previous benchmark’s limitations by maintaining its foundational design principles while incorporating tasks of increased complexity. Several improvements include:

Human Testing Calibration: Extensive first-party human testing establishes robust baseline performance metrics.
Reduction of Brute-Forcible Tasks: Task design has been refined to discourage reliance on computationally intensive brute-force techniques.
Wider Signal Bandwidth: The task set encompasses a broader spectrum of difficulty, facilitating differentiation among AI systems with varying reasoning capabilities.
Consistency Across Subsets: Calibration of task difficulty aims to ensure that performance metrics are predictive across different evaluation subsets.

Evaluation and Task Design Paradigms

ARC-AGI-2 tasks demand higher levels of compositional generalization than their predecessors. They are characterized by:

Multi-Rule and Multi-Step Reasoning: Tasks often necessitate the simultaneous application and interaction of multiple rules, with sequential dependencies.
Contextual Rule Application: Rule application may depend on specific contextual elements within the task environment.
In-Context Symbol Definition: Symbols defined within task contexts present substantial challenges for AI systems.

The updated scores reflect a marked increase in task difficulty for AI models, evidenced by top models scoring below 3% on ARC-AGI-2’s Semi-Private Evaluation, a stark contrast to ARC-AGI-1 performance.

Implications and Future Directions

The introduction of ARC-AGI-2 underscores the commitment to advancing AI systems toward more generalized reasoning abilities, with potential implications for developing models that can emulate human-like problem-solving across diverse domains. As AI models continue to evolve, future research will likely explore new paradigms and methodologies to meet the challenges posed by ARC-AGI-2, fostering advancements towards genuine artificial general intelligence. Further developments may include more nuanced task designs, collaboration across research communities, and real-time adaptation techniques to enhance model flexibility and performance under constrained conditions.