H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark (2409.01374v1)

Published 2 Sep 2024 in cs.AI

Abstract: The Abstraction and Reasoning Corpus (ARC) is a visual program synthesis benchmark designed to test challenging out-of-distribution generalization in humans and machines. Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods. Comparing human and machine performance is important for the validity of the benchmark. While previous work explored how well humans can solve tasks from the ARC benchmark, they either did so using only a subset of tasks from the original dataset, or from variants of ARC, and therefore only provided a tentative estimate of human performance. In this work, we obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks from the original ARC problem set. We estimate that average human performance lies between 73.3% and 77.2% correct with a reported empirical average of 76.2% on the training set, and between 55.9% and 68.9% correct with a reported empirical average of 64.2% on the public evaluation set. However, we also find that 790 out of the 800 tasks were solvable by at least one person in three attempts, suggesting that the vast majority of the publicly available ARC tasks are in principle solvable by typical crowd-workers recruited over the internet. Notably, while these numbers are slightly lower than earlier estimates, human performance still greatly exceeds current state-of-the-art approaches for solving ARC. To facilitate research on ARC, we publicly release our dataset, called H-ARC (human-ARC), which includes all of the submissions and action traces from human participants.

PDF Abstract

H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark

In this detailed empirical paper by Solim LeGris, Wai Keen Vong, Brenden M. Lake, and Todd M. Gureckis, the authors investigate human performance on the Abstraction and Reasoning Corpus (ARC) benchmark, a challenging dataset designed to test out-of-distribution generalization in both humans and machines. Unlike previous studies that evaluated human performance on only subsets or variants of the ARC tasks, this work comprehensively evaluates human performance using the entire set of 400 training and 400 evaluation tasks in ARC.

Study Design and Methodology

The paper is extensive, involving 1729 participants recruited via Amazon Mechanical Turk (MTurk). Each participant was tasked with solving five randomly selected tasks from either the training or evaluation sets. This resulted in a substantial dataset consisting of 15,744 attempts on ARC tasks, enriched with detailed action traces and solution descriptions.

The experimental setup followed the evaluation procedure outlined in the original ARC paper, where participants had three attempts to solve each task, receiving only minimal feedback after each attempt. This minimal feedback system aligns with the nature of many real-world problem-solving scenarios, where iterative refinement plays a critical role.

Key Findings

The paper finds that:

The average human performance on the ARC training set was between 73.3% and 77.2% correct, with an empirical average of 76.2%.
On the evaluation set, the performance was somewhat lower, ranging between 55.9% and 68.9% with an empirical average of 64.2%.
A significant observation is that 98.8% of the ARC tasks were solvable by at least one participant within three attempts, indicating high human solvability for the majority of tasks.

Performance Comparison with AI Systems

Humans substantially outperform current AI systems on the ARC benchmark. For instance, the best-performing AI models, Claude-3.5-N and GPT-4o-NS, achieved 19.3% and 42.0% accuracy respectively on the evaluation set, which is considerably lower than human performance. This stark contrast highlights the complexity and challenges associated with ARC tasks for current AI systems.

Implications and Future Directions

The findings have several critical implications:

Cognitive Science:
- The dataset (H-ARC) provides rich behavioral data on human problem-solving strategies, which could be valuable for cognitive scientists. Understanding how humans solve ARC tasks could yield insights into the nature of abstract reasoning and cognitive processes.
AI Development:
- The gap between human and AI performance on ARC tasks underscores the limitations of current models in abstract reasoning and out-of-distribution generalization. Future AI research could focus on mimicking human problem-solving strategies, particularly the iterative refinement approach observed in human participants.
Benchmark Evaluation:
- The comprehensive evaluation of human performance on the full ARC benchmark sets a more accurate baseline for future AI models. AI systems need to demonstrate significant improvements to surpass this established human baseline.

Detailed Analyses

The authors perform several analyses to understand the difficulty of ARC tasks, including:

Grid Dimension Errors: They observe that humans make fewer errors related to grid dimensions compared to AI models.
Edit Distance: Both humans and AI show similar edit distances from correct outputs, though the types of errors differ.
Error Divergence and Copy Errors: Humans exhibit a wider range of unique errors, indicating diverse problem-solving approaches, whereas AI errors were often less varied.

Conclusion

This paper provides a robust estimate of human performance on the ARC benchmark and highlights the substantial gap between human and AI capabilities in solving these tasks. The public release of the H-ARC dataset is expected to foster further research, helping the AI community develop more advanced models that better mimic human abstract reasoning and problem-solving capabilities. Through detailed action traces and solution descriptions, the dataset offers a valuable resource for both cognitive scientists and AI researchers aiming to bridge the gap between human and machine intelligence in complex reasoning tasks.