Extent of human-like abstract reasoning achieved by AI on ARC tasks

Determine the extent to which state-of-the-art AI models that achieve high accuracy on ARC tasks (e.g., OpenAI’s o3 reasoning model) have achieved human-like abstract reasoning abilities, as opposed to relying on surface-level patterns or shortcuts.

Background

The paper motivates its paper by noting that OpenAI’s o3-preview obtained very high accuracy on ARC-like tasks, raising the question of whether these results reflect genuinely human-like abstract reasoning or reliance on superficial correlations.

To address this, the authors evaluate models on ConceptARC and analyze both output accuracy and the natural-language rules produced by models. While their empirical results provide evidence and insights, the authors explicitly state that the broader question remains unclear.

References

Despite the high accuracy of o3 on ARC tasks, it is not clear to what extent AI systems have achieved human-like abstract reasoning abilities.

Do AI Models Perform Human-like Abstract Reasoning Across Modalities? (2510.02125 - Beger et al., 2 Oct 2025) in Section 1, Introduction