ARC Prize 2024: Technical Report (2412.04604v2)

Published 5 Dec 2024 in cs.AI

Abstract: As of December 2024, the ARC-AGI benchmark is five years old and remains unbeaten. We believe it is currently the most important unsolved AI benchmark in the world because it seeks to measure generalization on novel tasks -- the essence of intelligence -- as opposed to skill at tasks that can be prepared for in advance. This year, we launched ARC Prize, a global competition to inspire new ideas and drive open progress towards AGI by reaching a target benchmark score of 85\%. As a result, the state-of-the-art score on the ARC-AGI private evaluation set increased from 33\% to 55.5\%, propelled by several frontier AGI reasoning techniques including deep learning-guided program synthesis and test-time training. In this paper, we survey top approaches, review new open-source implementations, discuss the limitations of the ARC-AGI-1 dataset, and share key insights gained from the competition.

Summary

The paper presents a comprehensive analysis of the ARC-AGI benchmark, revealing its resistance to traditional AI techniques and highlighting the gap in achieving AGI.
The paper details emerging methodologies like deep learning-guided program synthesis and test-time training, with top approaches reaching scores up to 55.5%.
The paper outlines significant implications for future AGI research, promoting open science, transparent collaboration, and enhanced adaptability in AI models.

Overview of the ARC Prize 2024 Technical Report

The ARC Prize 2024 technical report offers a comprehensive analysis of the progress and challenges associated with the ARC-AGI benchmark, a crucial yet unsolved benchmark aimed at evaluating artificial general intelligence (AGI) systems. Established five years prior, ARC-AGI has proven resistant to advances in AI, including the rise of LLMs. This report explores the outcomes of the ARC Prize 2024, a competition designed to spur innovation and open scientific discourse toward achieving AGI, particularly by incentivizing the development of models capable of attaining a benchmark score of 85% on ARC-AGI tasks.

Benchmark Overview and Historical Context

ARC-AGI, originally introduced by François Chollet in 2019, is characterized by tasks that require the application of human core knowledge, broadly accessible without specialized world knowledge or language constraints. The dataset consists of 1,000 tasks categorized into training, public, semi-private, and private evaluation sets. Despite being accessible to humans, who typically achieve nearly perfect scores, AI systems have struggled significantly with ARC-AGI due to its design, which emphasizes generalization and adaptability beyond training data. Previous attempts, such as the 2020 Kaggle competition and subsequent ARCathons, highlighted the inadequacy of traditional deep learning models, with earlier versions of these models achieving no more than a 1% success rate.

ARC Prize 2024 Results

The ARC Prize 2024, conducted between June and November 2024, attracted 1,430 teams and featured multiple prize categories, yet the Grand Prize for achieving an 85% score remained unclaimed. Notwithstanding, significant progress was made, with the leading team, MindsAI, reaching a score of 55.5% but opting not to open-source their solution, thereby forfeiting eligibility for a prize. The competition provided insights into emerging methodologies, particularly the fusion of deep learning-guided program synthesis and test-time training (TTT) strategies. The competition's Kaggle leaderboard required submissions to run without internet access, ensuring standalone assessment, while the ARC-AGI-Pub leaderboard allowed for a more relaxed setup to evaluate LLM capabilities.

Emerging Methodologies

Notable contributions from the ARC Prize 2024 have underscored advancements in several key areas:

Deep Learning-Guided Program Synthesis: This approach leverages LLMs to generate code or guide program search processes within domain-specific languages (DSLs), aiming to mitigate the combinatorial explosion problem of brute-force searches. Ryan Greenblatt's work exemplifies the potential of this strategy, as it achieved a 42% success rate on ARC-AGI-Pub using GPT-4o to synthesize Python programs.
Test-Time Training (TTT): TTT has emerged as an effective approach to dynamically adapt models to specific task requirements during the inference phase. It entails fine-tuning pre-trained models on demonstration pairs to enhance task-specific performance. Notable implementations from the ARC Prize include the ARChitects’ TTT model, which achieved a 53.5% score on the private evaluation set.
Combining Induction and Transduction: This hybrid approach addresses the complementary strengths of program synthesis (induction) and direct output prediction (transduction), enabling better performance across diverse task types.

The report indicates that these strategies are converging towards higher efficacy, with LLM-based models significantly benefiting from test-time adaptability enhancements.

Future Directions and Implications

The evolution of these techniques suggests a trajectory where deep learning-augmented program synthesis and TTT will become prevalent in tackling AGI-oriented challenges, potentially influencing broader AI system design practices. The ARC Prize has fostered significant open-source contributions, encouraging collaboration and transparency in AGI research. Looking forward, the organizers are considering updates for ARC-AGI-2 to mitigate overfitting risks and improve task diversity, alongside modifications to the competition framework to foster inclusion across varied research entities.

In conclusion, while notable advancements have been made in the ARC Prize 2024, achieving AGI remains an elusive goal. The active engagement and novel methodologies surfacing from this competition form a pivotal foundation for continued research and exploration into achieving true artificial general intelligence.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Tweets

https://twitter.com/bryanlanders/status/1866617432195870722

https://twitter.com/baykenney/status/1865979920993685618

https://twitter.com/matteopelleg/status/1891286287794962747

https://twitter.com/jiemingchu/status/1872154255097729247

https://twitter.com/Kermitofmyballs/status/1882885485879861691

https://twitter.com/Kermitofmyballs/status/1874059111618519445

YouTube

Show All Videos