Language-complete ARC (LARC)

Updated 23 November 2025

Language-complete ARC (LARC) is a hybrid benchmark that pairs ARC tasks with natural language instructions, allowing task solutions to be derived from language alone.
LARC employs a describe/build game and bandit-driven allocation to efficiently verify instructions, ensuring reliable and diverse human-authored procedures.
Empirical evaluations reveal that while natural language boosts sample efficiency in neural models, challenges like ambiguity and generalization persist in synthesis approaches.

Language-complete ARC (LARC) is an extension of the Abstraction and Reasoning Corpus (ARC) benchmark that augments the original visual puzzle tasks with human-authored natural language instructions—termed "natural programs"—which are sufficient for humans (and potentially AI systems) to solve the tasks from language alone. LARC thus serves as a language-vision hybrid benchmark, enabling the paper of how natural language can operationalize cognitive reasoning and procedural generalization in the ARC domain. The LARC initiative exposes both the expressivity of human language in communicating abstract task structure and the limitations of current program synthesis and machine reasoning paradigms when confronted with the diversity and ambiguity of natural instructions.

1. Definition and Construction of LARC

LARC is formally defined as a pairing of ARC tasks with language-complete instructions. For the original collection of 400 ARC tasks, a "natural program" is a plain-English description such that a human, given only the instruction and a novel input grid, can reconstruct the correct output with no further supervision. The LARC dataset comprises all such pairs, denoted as

$\mathrm{LARC} = \left\{ (t, i) \mid t \in \mathcal{T},\ i \in I(t) \right\}$

where $\mathcal{T}$ is the set of ARC tasks and $I(t)$ is the set of verified language instructions for task $t$ (Acquaviva et al., 2021).

Language-completeness is empirically established: LARC includes at least one successful instruction for 354 out of 400 tasks (88%), verified through a two-player protocol where one human writes and self-checks an instruction, and another human executes it from scratch. Each instruction includes expected input properties, output dimensions, and a step-by-step procedure, encoded as free-form English (Acquaviva et al., 2021).

2. Distinctive Features of Natural Programs in LARC

Natural programs in LARC diverge fundamentally from fixed DSL-based or code-like representations used in classical program synthesis:

Primitive Diversity: Human-authored instructions deploy a much wider array of conceptual primitives than typical DSLs. The ARC DSL counts 103 primitives, while generic program synthesis DSLs use under 30. LARC instructions invoke domain-general control structures ("loop until...," "for each..."), high-level object and spatial reasoning, and a long tail of content words encompassing shape, spatial relations, and algorithmic action (Acquaviva et al., 2021).
Communicative and Meta-linguistic Strategies: Instructions interleave executable procedure steps (∼41%) with meta-statements: framing (∼26%), validation (∼17%), and clarification (∼13%). For example, "At the end you should have exactly 4 green cells" is a validation meta-instruction. This meta-linguistic content provides soft constraints, pre-conditions, and error-checking that DSL approaches typically ignore (Acquaviva et al., 2021).

These features highlight that natural programs are highly expressive but also inherently ambiguous compared to unambiguous, compositional DSLs. Human communication leverages context, pragmatic inference, and redundancy—modeled inadequately by current synthesis systems.

3. Methodology for Data Collection and Verification

LARC's construction is based on a structured crowd-sourcing protocol:

Describe/Build Game: For each ARC task, a "describer" develops an instruction from the training examples, then self-verifies by applying the instruction to a novel input. If successful, a "builder" (unaware of examples, provided only the instruction and held-out input) attempts to reconstruct the solution using a restricted set of grid-editing primitives.
Bandit-driven Allocation: An adaptive best-arm identification algorithm allocates verification attempts efficiently, significantly reducing annotation costs (∼$3.7K versus a naive ∼$10.8K approach for the full corpus) (Acquaviva et al., 2021).
Instruction Tags and Statistics: Instructions are annotated with semantic tags (procedure, framing, validation, etc.), and the full corpus contains 642 unique content words with broad coverage of ARC concepts (Acquaviva et al., 2021).

4. Benchmarking and Empirical Analysis

Comparative evaluation of LARC exposes the limitations of state-of-the-art synthesis techniques and deep learning models on language-complete benchmarks:

Program Synthesis Baselines: Standard "generate-and-check" search over a 103-primitive ARC DSL, even when conditioned on both I/O examples and LARC text, solves at best 12% (22/183) of a stratified subset of ARC tasks. NL-only synthesis solves 0.5% (1/183). These failures are attributed to:
- Scalability: The enlarged primitive space and combinatorial explosion render brute-force or enumeration approaches intractable.
- Referential Ambiguity: Meta-statements and flexible referencing in human instructions confound NL-to-code models which assume literal paraphrasing.
- Lack of IO Supervision: Without input-output checking, code generated from natural language is often invalid or incomplete (Acquaviva et al., 2021).
Neural Approaches with Natural Language: Methods such as LatFormer, which integrate geometric priors into attention masks and condition on LARC text via a pretrained T5 encoder, demonstrate improved sample efficiency and success rates on geometric reasoning tasks drawn from LARC, outperforming both neural and symbolic program synthesis baselines for categories such as translation and rotation-based tasks (Atzeni et al., 2023).

A key insight is that the natural language modality in LARC, although beneficial for sample efficiency and concept coverage in neural models, is not by itself sufficient for program synthesis approaches that lack mechanisms for handling ambiguity and learning new primitives from language (Acquaviva et al., 2021, Atzeni et al., 2023).

5. Methodological Advances and Architectural Directions

LARC has motivated multiple methodological advancements:

Vision→Language→Vision Pipelines: A fully automated pipeline encodes ARC input/output grids into structured English, applies pre-trained LLMs (e.g., GPT-3 or BLOOM) to generate procedural reasoning steps, and decodes the results back to grids. This approach can solve previously unsolved ARC tasks but remains below the accuracy of state-of-the-art DSL-based approaches in aggregate, highlighting the gap between learned priors (in LLMs) and hand-crafted task-specific knowledge (Camposampiero et al., 2023).
Lattice-Symmetry Attention Mechanisms: Geometry priors can be encoded directly in neural architectures via binary or soft attention masks that implement group actions (translation, rotation, reflection, scaling) in hypercubic lattice domains. The LatFormer model demonstrates that such architectural biases yield 2–3 orders-of-magnitude sample efficiency improvements on LARC geometric tasks relative to standard Transformers (Atzeni et al., 2023).
Concept Induction and Hybrid Neuro-Symbolic Systems: Proposals for next-generation synthesizers emphasize learning new primitives dynamically from language and examples ("concept induction"), modeling communicative structure in instructions explicitly (e.g., extracting postconditions from validation statements), and combining LLMs with symbolic search and grounding modules (Acquaviva et al., 2021).

6. Current Limitations and Outstanding Challenges

Despite the progress enabled by LARC, significant challenges persist:

Ambiguity and Grounding: Natural programs, while intuitive for humans, are ambiguous for machines absent shared context or explicit grounding. Meta-linguistic constructs such as validations and clarifications cannot be translated into code with current NL-to-DSL pipelines, and example-based supervision is often necessary to resolve underspecification (Acquaviva et al., 2021).
Primitive Scope and Expansion: The need to invent or induce new high-level operations dynamically from language input remains a critical bottleneck; fixed DSLs are incapable of accommodating the full richness of LARC instructions.
Generalization Beyond Geometric Priors: Architectures like LatFormer excel on tasks with lattice-symmetry priors, but do not generalize straightforwardly to ARC problems involving non-geometric priors such as color reasoning, object permanence, or complex algorithmic manipulation (Atzeni et al., 2023).
Vision-Language Alignment: In language-centric pipelines, hand-crafted or heuristic vision modules are brittle. LLMs, in zero-shot settings, may hallucinate objects, misparse spatial descriptors, or fail at arithmetic reasoning beyond simple counts (Camposampiero et al., 2023).

7. Future Directions

The empirical and conceptual findings in LARC research delineate concrete priorities for advancing AI reasoning with natural language:

Joint Vision-Language Training: Replace hand-built vision pipelines with learned multimodal models (e.g., OFA, Flamingo) to enable robust, end-to-end mapping from pixels to structured linguistic description and back (Camposampiero et al., 2023).
Meta-Linguistic and Interactive Modules: Develop frameworks that interpret meta-statements as soft constraints (e.g., validations as test assertions), and enable systems to request clarification or verification dynamically, thereby reducing ambiguity.
Dynamic DSL Growth: Embrace language-complete, open-domain benchmarks where systems must grow their primitive set in response to language, reflecting the natural learning and instruction process observed in humans (Acquaviva et al., 2021).
Broader Knowledge Priors: Extend architectural mechanisms (e.g., soft attention masks) beyond geometric transformations to encompass other "core knowledge" priors implicit in ARC and LARC tasks—counting, recursive grouping, object attribute manipulation, etc. (Atzeni et al., 2023).
Fine-tuned LLMs and Prompt Engineering: Investigate the impact of fine-tuning LLMs on corpora of human-annotated ARC textual descriptions, along with adaptive prompt-tuning and retrieval of analogous task explanations (Camposampiero et al., 2023).

LARC sets a new standard for language-vision benchmarking in abstract reasoning and exposes foundational issues in learning, generalization, and communicative competence for both symbolic and neural approaches. Its ongoing paper is pivotal for progress toward broadly intelligent systems capable of human-like task learning and instruction-following.