H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark
In this detailed empirical paper by Solim LeGris, Wai Keen Vong, Brenden M. Lake, and Todd M. Gureckis, the authors investigate human performance on the Abstraction and Reasoning Corpus (ARC) benchmark, a challenging dataset designed to test out-of-distribution generalization in both humans and machines. Unlike previous studies that evaluated human performance on only subsets or variants of the ARC tasks, this work comprehensively evaluates human performance using the entire set of 400 training and 400 evaluation tasks in ARC.
Study Design and Methodology
The paper is extensive, involving 1729 participants recruited via Amazon Mechanical Turk (MTurk). Each participant was tasked with solving five randomly selected tasks from either the training or evaluation sets. This resulted in a substantial dataset consisting of 15,744 attempts on ARC tasks, enriched with detailed action traces and solution descriptions.
The experimental setup followed the evaluation procedure outlined in the original ARC paper, where participants had three attempts to solve each task, receiving only minimal feedback after each attempt. This minimal feedback system aligns with the nature of many real-world problem-solving scenarios, where iterative refinement plays a critical role.
Key Findings
The paper finds that:
- The average human performance on the ARC training set was between 73.3% and 77.2% correct, with an empirical average of 76.2%.
- On the evaluation set, the performance was somewhat lower, ranging between 55.9% and 68.9% with an empirical average of 64.2%.
- A significant observation is that 98.8% of the ARC tasks were solvable by at least one participant within three attempts, indicating high human solvability for the majority of tasks.
Performance Comparison with AI Systems
Humans substantially outperform current AI systems on the ARC benchmark. For instance, the best-performing AI models, Claude-3.5-N and GPT-4o-NS, achieved 19.3% and 42.0% accuracy respectively on the evaluation set, which is considerably lower than human performance. This stark contrast highlights the complexity and challenges associated with ARC tasks for current AI systems.
Implications and Future Directions
The findings have several critical implications:
- Cognitive Science:
- The dataset (H-ARC) provides rich behavioral data on human problem-solving strategies, which could be valuable for cognitive scientists. Understanding how humans solve ARC tasks could yield insights into the nature of abstract reasoning and cognitive processes.
- AI Development:
- The gap between human and AI performance on ARC tasks underscores the limitations of current models in abstract reasoning and out-of-distribution generalization. Future AI research could focus on mimicking human problem-solving strategies, particularly the iterative refinement approach observed in human participants.
- Benchmark Evaluation:
- The comprehensive evaluation of human performance on the full ARC benchmark sets a more accurate baseline for future AI models. AI systems need to demonstrate significant improvements to surpass this established human baseline.
Detailed Analyses
The authors perform several analyses to understand the difficulty of ARC tasks, including:
- Grid Dimension Errors: They observe that humans make fewer errors related to grid dimensions compared to AI models.
- Edit Distance: Both humans and AI show similar edit distances from correct outputs, though the types of errors differ.
- Error Divergence and Copy Errors: Humans exhibit a wider range of unique errors, indicating diverse problem-solving approaches, whereas AI errors were often less varied.
Conclusion
This paper provides a robust estimate of human performance on the ARC benchmark and highlights the substantial gap between human and AI capabilities in solving these tasks. The public release of the H-ARC dataset is expected to foster further research, helping the AI community develop more advanced models that better mimic human abstract reasoning and problem-solving capabilities. Through detailed action traces and solution descriptions, the dataset offers a valuable resource for both cognitive scientists and AI researchers aiming to bridge the gap between human and machine intelligence in complex reasoning tasks.