Abstraction and Reasoning Corpus (ARC)

Updated 30 June 2025

The Abstraction and Reasoning Corpus (ARC) is an artificial intelligence benchmark designed to evaluate systems on their ability to perform human-level abstraction, reasoning, and generalization. Introduced by François Chollet in 2019, ARC consists of tasks requiring solvers to infer transformation rules from a very limited set of input-output examples, then generalize these rules to novel test inputs. The construction, evaluation, and paper of ARC sit at the intersection of program induction, cognitive science, symbolic and neural methods, and human-AI comparative analysis.

1. Foundation and Objectives

ARC is structured to probe the "core knowledge" capacities underlying human intelligence—including concepts of objects, goal-directedness, numbers and counting, and basic geometry and topology. Each ARC task provides a few (often 2–5) input-output grid pairs (where both are colored integer arrays), and one or more test inputs that must be mapped to their correct output. The underlying rules range from simple (e.g., coloring or moving objects) to highly compositional, demanding analogical or relational reasoning. Unlike most AI benchmarks, ARC provides only minimal supervision per task and is "developer-aware," with its test set curated to require genuine rule induction rather than overfitting or instance memorization.

ARC was designed to challenge current machine learning, especially in the ability to learn abstract concepts from few samples and to generalize outside the training distribution. While humans routinely solve such inductive reasoning problems, most AI systems perform at human level or above only in benchmarks with large training datasets or narrow domains.

2. Human Performance and Behavioral Insights

Behavioral studies have revealed that humans consistently outperform machines on ARC tasks. In a paper sampling 40 ARC tasks, participants achieved an average accuracy of 83.8% per task, with most tasks solved by over 80% of participants and even the most difficult tasks being solvable by at least one person (mean solve time ≈3 minutes per task) (Johnson et al., 2021 ). Participants' action logs revealed both consistent strategic motifs (such as resizing or copying grids) and substantial variability, especially as task complexity increased; diversity in solution paths, as measured by the Levenshtein edit distance between action sequences, negatively correlated with accuracy.

Human participants converged on consistent natural language to describe task transformations, with low naming divergence (unique-to-total word ratio) within tasks. These descriptions clustered into categories such as color, object, geometric relations, and transformation types, often using metaphor or abstraction (e.g., "tails," or "flowers"). Longer textual descriptions generally predicted harder tasks and lower accuracy, indicating that tasks expressing "simple" concepts were more readily solved. Human errors tended to preserve structural object properties—contrasting with machine errors, which often violated objectness or task semantics.

These findings indicate that human performance in ARC is characterized by rapid abstraction, compositional reasoning, reliance on robust object-centric priors, and integration of procedural/linguistic representations—offering actionable targets for AI modeling.

3. Language, Communication, and Natural Programs

Language plays a pivotal role in human ARC problem solving. In the Language-complete ARC (LARC), participants solved ARC tasks by communicating only through natural language. Human utterances not only mirrored procedural constructs found in code (such as conditionals, object manipulation, loops), but also went beyond executable instructions: they contained framing, validation, meta-communication, and the use of diverse conceptual primitives that exceeded typical domain-specific language (DSL) coverage (Acquaviva et al., 2021 ).

Instructions were annotated with a taxonomy of 17 concept categories, including procedure, spatial reference, validation, loop, and object detection. While procedural content was frequent, framing and validation were equally present—contrasting with the sparse commentary seen in code. LARC demonstrated that, for at least 88% of tasks, purely language-based instructions were sufficient for human-to-human transmission of ARC solutions.

This analysis highlights a "language gap": humans use natural, open-ended language to communicate procedural knowledge, whereas conventional AI systems rely on rigid, closed DSLs. LARC data suggest that program induction for ARC will benefit from grounding symbolic synthesis in natural language, enabling procedural and meta-communicative flexibility.

4. Symbolic, Neuro-symbolic, and Statistical Approaches

Multiple research threads address ARC’s challenge by proposing symbolic, neuro-symbolic, or connectionist methods:

Symbolic Program Synthesis: Early approaches, including top Kaggle solutions and the ARGA system, model ARC tasks as program synthesis over DSLs with hand-engineered primitives (rotations, coloring, compositional graph abstractions) (Xu et al., 2022 ). Object- and relation-centric representations, especially graph abstractions, provide efficiency by enabling constraint acquisition, state hashing, and Tabu search. By pruning the search space using constraints extracted from training pairs (e.g., "positionUnchanged" predicates), ARGA outperforms pixel-level or brute-force search, solving 57/160 object-centric tasks with orders of magnitude fewer search nodes than baseline systems.
Minimum Description Length (MDL) Modeling: Descriptive grid models guided by the MDL principle favor solutions that yield the shortest possible description of the data (lossless compression), supporting both explainability and search efficiency (Ferré, 2021 ). MDL-driven models achieve incremental improvement while producing interpretable, symbolic explanations of input-output relations.
Neuro-symbolic Systems: Hybrid approaches such as DreamCoder use neural networks for proposal distribution (e.g., to guide program search) and extend their DSLs by learning new abstractions (compression) (Alford et al., 2021 , Bober-Irizar et al., 5 Feb 2024 ). The NSA (Neuro-symbolic ARC Challenge) system uses a transformer to propose promising DSL primitives, dramatically narrowing the symbolic search space and achieving a 27% gain over prior state of the art on the ARC evaluation set (Batorski et al., 8 Jan 2025 ).
Generalized Planning: GPAR expresses ARC in the Planning Domain Definition Language (PDDL), representing images as graphs with object/attribute predicates, and synthesizes planning programs with pointer variables (Lei et al., 15 Jan 2024 ). Domain knowledge (object types, abstraction selection, constraint-based pruning) enables tractable planning program induction, leading to leading test accuracy on object-centric ARC subsets.
Inductive Logic Programming (ILP): ILP-based solvers synthesize human-readable logic programs from object-centric DSLs, offering sample efficiency and transparency in the induction of generalizing rules, though they are limited by the manual breadth of their primitives (Rocha et al., 10 May 2024 ).
Connectionist Methods: VAE-based approaches attempt to solve analogy-based ARC tasks by mapping input/output grids into latent vector spaces, using vector arithmetic to synthesize novel outputs; these methods excel on "easy" analogies but do not capture compositional reasoning required for more complex ARC items (Thoms et al., 2023 ).

5. LLMs, Representations, and Human Comparisons

ARC exposes persistent limitations of LLMs—including GPT-4 and successors—in abstract visual reasoning and generalization:

Direct LLM Prompting: Vanilla LLMs prompted with serialized grid pairs perform modestly, with GPT-4 solving only 13/50 of the simplest ARC problems in standard grid-encoding formats (Xu et al., 2023 ). Performance is highly sensitive to input format and object representation.
Object-based Abstraction: When external object-centric abstractions (e.g., via ARGA) are supplied, LLM task accuracy and reasoning quality almost double. Alignment of feature representation with human "core knowledge" concepts also aligns model errors and reasoning with human patterns.
Failure Analysis: Comparative studies with children and adults show that LLMs rely on shallow, often combinatorial strategies (e.g., element-wise matrix operations), mirroring young children's fallback behaviors but not adults' abstract rule-based reasoning (Opiełka et al., 13 Mar 2024 ).
Language Conversion Pipelines: Transforming ARC tasks to natural language descriptions and leveraging zero-shot LLMs to generate output descriptions (subsequently decoded into grids) allows solving some previously intractable problems, albeit still well below the top symbolic baselines (Camposampiero et al., 2023 ).
Analogy-guided Data: Efforts such as GIFARC embed explicit human-intuitive analogies into ARC-style datasets, promoting analogy-driven rather than exhaustive pattern search and aligning LLM reasoning steps more closely with human explanations (Sim et al., 27 May 2025 ).

6. Program Search, Planning, and Knowledge Augmentation

Ongoing work targets methods to more efficiently search the open-ended ARC solution space, leveraging LLMs as program generators or planners:

Concept-based Guidance: Methods such as ConceptSearch introduce scoring functions that evaluate candidate programs based not on superficial pixel similarity (e.g., Hamming distance) but on concept embeddings—using CNNs or LLM-derived hypotheses—to prioritize transformations that capture the underlying logic of the task. This approach doubles solved task counts and improves efficiency by 30% over pixel-based baselines (Singhal et al., 10 Dec 2024 ).
Planning-aided Solvers and Knowledge Ontologies: The Knowledge Augmentation for Abstract Reasoning (KAAR) method augments LLM context with hierarchically organized core priors (objectness, geometry, goal-directedness), enabling stage-wise reasoning and reducing context interference (Lei et al., 23 May 2025 ). KAAR achieves up to a 65% relative improvement over prompting-only approaches, especially in tasks demanding object movement or counting.
Reinforcement Learning Environments: ARCLE provides a Gymnasium-based environment for RL research on ARC, exposing its vast action space, hard-to-reach sparse rewards, and multi-task structure to RL methods such as proximal policy optimization, meta-RL, and GFlowNets (Lee et al., 30 Jul 2024 ). Auxiliary losses and non-factorial policies improve learning and adaptability.

7. Symbolic Abduction, Interpretability, and Core Knowledge

Recent symbolic frameworks explicitly model the abductive reasoning process: input-output grids are abstracted into multi-level knowledge graphs (pixels, objects, grids, tasks), from which the system extracts core knowledge (invariants, key relations) that sharply constrain the solution search (Lim et al., 27 Nov 2024 ). Abduction is operationalized by identifying which attributes are consistently relevant in all demonstrations, then using DSL-based symbolic search to realize transformations. This method achieves substantial gains in structural prediction (e.g., color set, size), interpretability, and efficiency compared to direct grid transition methods.

Conclusion

The Abstraction and Reasoning Corpus serves as a rigorous benchmark for evaluating and advancing artificial intelligence toward human-level abstraction, generalization, and reasoning. Studies of human and machine performance on ARC provide insights into compositionality, language grounding, inductive and abductive reasoning, and the value of object-centric and symbolic representations. Despite significant progress through a spectrum of approaches—symbolic, neuro-symbolic, statistical, and hybrid—ARC remains unsolved in its general form. The challenge continues to inspire advances, including improved neural-guided program synthesis, concept-based search, RL-based reasoning, and explicit modeling of core cognitive priors, with each contributing to a deeper understanding and modeling of human-level intelligence in machines.

PDF Markdown Chat (Pro)