ARC-Bench: Benchmark for ARC-AGI Tasks
- ARC-Bench is a resampleable benchmarking framework for ARC-AGI tasks that generates diverse task families with latent rules and varying nuisance factors.
- It employs a generator-based formulation via ARC-TGI, where human refinement and episode-level constraints ensure consistent train/test splits and prevent overfitting.
- Evaluation metrics include exact-match accuracy, macro-averaging across families, and sample-efficiency curves, enabling robust controlled generalization studies.
ARC-Bench denotes a benchmark construction paradigm for ARC-AGI-style abstraction and reasoning tasks in which evaluation is performed on resampleable task families rather than on a static set of hand-authored puzzles. In the ARC-TGI framework, ARC-Bench is defined as a rigorous, resampleable ARC benchmarking suite built from human-validated task generators, each of which samples diverse train/test episodes that preserve a latent rule while varying nuisance factors such as grid size, palette, and distractors (Lehmann et al., 5 Mar 2026). The benchmark is motivated by longstanding limitations of static ARC collections—overfitting, dataset leakage, memorization of specific instances, and poor support for controlled generalization studies—and remains anchored in the original ARC-AGI problem setting, where a system must infer a transformation from a small set of demonstrations and produce exact-match outputs on novel test inputs (Lehmann et al., 5 Mar 2026, Chollet et al., 2024).
1. ARC-AGI context and the motivation for ARC-Bench
ARC-Bench is intelligible only in relation to ARC-AGI, the Abstraction and Reasoning Corpus benchmark for generalization on novel visual grid tasks. In ARC-AGI, each task provides a set of demonstration pairs , with a median of three examples, together with one or more test inputs $X_{\text{test}}=\{x_j^\*\}$. Each grid is a rectangular array with and . The objective is to infer a transformation such that for all demonstrations and then apply to each test input; a task is counted as solved only when every predicted test output exactly matches the ground truth (Chollet et al., 2024). In this literature, ARC-AGI is explicitly distinguished from AI2’s unrelated ARC science-exam dataset (Chollet et al., 2024).
The original benchmark design emphasizes novelty: each ARC-AGI task follows different logic, the tasks are human-authored for diversity, and the intended priors are restricted to human Core Knowledge such as objectness, basic topology, and elementary arithmetic (Chollet et al., 2024). Human performance is correspondingly high: two individuals originally scored and on the private set and together solved all 100 tasks, while a 2024 NYU study reported that $X_{\text{test}}=\{x_j^\*\}$0 of public evaluation tasks were solved by at least one Mechanical Turk worker when 10 workers were assigned per task (Chollet et al., 2024).
ARC-Bench arises from the observation that static ARC(-AGI) collections make progress hard to measure. The ARC-TGI paper identifies three recurring issues: overfitting and dataset leakage on small, fixed puzzles; memorization of specific instances instead of inductive rule learning; and the inability to perform controlled experiments that vary one nuisance factor while preserving the underlying rule (Lehmann et al., 5 Mar 2026). A central premise of ARC-Bench is therefore that an ARC task should be treated not as a single puzzle instance but as a task family: a distribution over solvable episodes sharing a latent rule.
2. Generator-based formulation in ARC-TGI
The technical substrate of ARC-Bench is ARC-TGI, the ARC Task Generators Inventory. ARC-TGI turns each ARC(-AGI) task into a compact Python generator that subclasses an abstract ARCTaskGenerator and implements three fixed-signature methods: create_input(self, taskvars, gridvars) → np.ndarray, which samples an input grid while randomizing nuisance factors; transform_input(self, grid, taskvars) → np.ndarray, which deterministically applies the latent rule to produce the output; and create_grids(self) → (taskvars: Dict, train_test_data), which assembles a complete episode and enforces task-level constraints across examples (Lehmann et al., 5 Mar 2026).
Each sampled episode is exported as a solver-facing bundle. That bundle contains natural-language “input reasoning chain” and “transformation reasoning chain” templates instantiated from taskvars and gridvars; partially evaluated Python code for input sampling, transformation, and episode construction, with sampled variables inlined; and the train/test grids in ARC-JSON form (Lehmann et al., 5 Mar 2026). The framework also includes optional helper libraries for consistent input construction, such as connected components, coloring, and densities, and solver-facing transformation primitives such as GridObject(s), find_connected_objects, and geometric operations (Lehmann et al., 5 Mar 2026).
The formalization used for ARC-Bench makes the task-family perspective explicit. A generator is written as $X_{\text{test}}=\{x_j^\*\}$1, where $X_{\text{test}}=\{x_j^\*\}$2 encodes task-level parameters controlled by hyperparameters $X_{\text{test}}=\{x_j^\*\}$3, such as palette-size ranges, grid-size ranges, and allowed symmetries. A latent rule $X_{\text{test}}=\{x_j^\*\}$4 determines a deterministic mapping $X_{\text{test}}=\{x_j^\*\}$5 between input and output grid spaces. An episode sampled from $X_{\text{test}}=\{x_j^\*\}$6 is
$X_{\text{test}}=\{x_j^\*\}$7
with task-level constraints $X_{\text{test}}=\{x_j^\*\}$8 ensuring that training examples collectively expose the variations needed to infer $X_{\text{test}}=\{x_j^\*\}$9 and that test examples do not introduce unseen features. Formally, 0 for all 1, and 2 couples examples across the episode to guarantee solvability (Lehmann et al., 5 Mar 2026).
The canonical constraint formulation prioritizes coverage of salient attributes and train–test consistency:
3
where 4 extracts salient attributes such as colors, shapes, positions, and sizes, and 5 enforces variation sufficient to disambiguate the rule (Lehmann et al., 5 Mar 2026). Reproducibility is handled by RNG seeding 6 in the generator call, with 7 and grid-level randomness similarly conditioned on 8; the framework’s create_task wrapper captures 9 and 0 into the exported witness program, enabling deterministic regeneration and verification (Lehmann et al., 5 Mar 2026).
3. Episode-level constraints, human refinement, and local verification
A defining feature of ARC-Bench is its treatment of ARC episodes as designed sets of examples rather than as independent samples. The ARC-TGI paper argues that independent per-example sampling often fails because it can omit critical variation, include test-only cues, or degenerate to trivial shortcuts such as identity or constant outputs (Lehmann et al., 5 Mar 2026). ARC-Bench therefore elevates create_grids to a first-class stage for cross-example coupling.
The enforced constraint types are explicit. They include train–test consistency, such as “no unseen colors or shape classes at test”; input-construction constraints, such as requiring objects to be connected under 4-/8-connectivity and bounding boxes to fit target placements; disambiguating coverage, such as requiring at least two sizes or multiple positions or orientations across training pairs; and rejection sampling until 1 is satisfied, supplemented by framework-level invariants and shortcut screening (Lehmann et al., 5 Mar 2026). This design directly addresses a common misconception that a large number of randomly generated ARC-like instances is sufficient for faithful benchmarking; the ARC-TGI formulation instead treats cross-example disambiguation as a benchmark requirement.
Human refinement is another central element. Generator authoring is described as human-in-the-loop: contributors analyze each task, identify taskvars and gridvars, author natural-language reasoning templates and episode-level constraints, draft generator code either with LLM assistance or manually, and iteratively resample and visualize outputs until grids and reasoning traces remain correct and natural under variation (Lehmann et al., 5 Mar 2026). The stated purpose is to avoid subtle rule violations and misaligned explanations.
The resulting exports are self-verifying. The inlined transformation program must reproduce stored train/test outputs exactly; invariant checks verify well-formed grids and declared restrictions such as “no unseen colors at test”; and optional shortcut screening filters identity or constant-output degeneracies unless they are intended by the latent rule (Lehmann et al., 5 Mar 2026).
The paper’s concrete examples illustrate why these mechanisms matter. In the within-family generalization example for stacked colored segments (“Taskbeb8660c”), taskvars are fixed per episode and include vertical stacking direction 2, horizontal alignment 3, a palette, and object size roles; gridvars vary per example and include the number of segments, segment lengths, colors, and positions before transformation. The transformation chain specifies stacking all segments along the chosen edge, aligning the stack to the specified side, and preserving segment colors, lengths, and orientation. The rationale for the constraint is explicit: if all training segments share the same length or color, the alignment/stacking rule is ambiguous, so 4 enforces diversity over lengths and colors and forbids unseen colors at test (Lehmann et al., 5 Mar 2026). A second example (“Task3befdf3e”) enforces that training include at least one 5 object and one 6 object when the transformation differs by size; otherwise the test episode could become unsolvable from the demonstrations (Lehmann et al., 5 Mar 2026).
4. Benchmark construction protocols, splits, and evaluation metrics
ARC-Bench is not a single immutable dataset but a benchmark construction protocol over a set of generators. The recommended procedure begins with family selection: define a suite 7 generators aligned to target splits, for example ARC-AGI-1 train families for in-distribution evaluation and ARC-AGI-2 families for out-of-distribution generalization. To avoid leakage in generalization studies, train and test splits should use disjoint family sets (Lehmann et al., 5 Mar 2026).
The recommended sampling budget is 8 episodes per family, with the paper’s practical default set to 9 after a small sweep. Three split regimes are then described. In the ID split, for each 0, one samples 1 episodes and partitions them 2 into train/test per family. In the OOD split, one holds out 3 as disjoint generators and samples 4 test episodes per held-out family. In the cross-benchmark split, one evaluates on public ARC-AGI-1 eval after training on ARC-TGI families (Lehmann et al., 5 Mar 2026). Practical guidance names two canonical suite constructions: ARC-Bench-ID, defined by selecting 5 from ARC-AGI-1 train families, sampling 6 episodes per family, splitting 7 train/test, and reporting MacroAcc, per-family Acc_F, and sample-efficiency curves 8; and ARC-Bench-OOD, defined by fine-tuning on ARC-TGI 9 and evaluating on held-out 0 from ARC-AGI-2 together with ARC-AGI-1 eval (Lehmann et al., 5 Mar 2026).
Exact reproducibility is treated as part of the benchmark definition. The protocol fixes a suite seed 1, per-family seeds 2, and per-episode seeds 3 so that resampling is deterministic across implementations. The reporting recommendations further require publication of 4, 5, 6, code versions, and any filtering criteria such as token-context limits (Lehmann et al., 5 Mar 2026).
The primary evaluation metrics are family-wise exact-match accuracy and macro-averaging across families. For a family 7,
8
The macro-average is
9
Generalization to held-out families is measured by
0
with 1 disjoint from 2. Sample efficiency is reported through curves 3 versus the number 4 of training demonstrations per episode, aggregated across families to study few-shot induction (Lehmann et al., 5 Mar 2026). The paper also describes an optional reasoning-chain consistency metric, defined as an alignment rate between generated chains and solver predictions, for example the fraction of steps in the transformation chain whose predicates or object references are validated by the solver’s predicted program or inferred intermediate states (Lehmann et al., 5 Mar 2026).
Leakage control is formalized rather than left implicit. When measuring across-family generalization, episodes from the same generator should not be mixed across train and test. When measuring within-family generalization, the study should be labeled separately and use disjoint episode seeds. In all cases, generators’ constraints and invariants should be verified to prevent test-only features (Lehmann et al., 5 Mar 2026).
5. Coverage, baseline behavior, and empirical properties
The ARC-TGI release underlying ARC-Bench contains 461 generators spanning three sources: 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks, and 66 ARC-AGI-2 tasks. The ARC-AGI-1 coverage is further partitioned into 200 train and 15 eval/test tasks, while the ARC-AGI-2 coverage is partitioned into 55 train and 11 eval/test tasks (Lehmann et al., 5 Mar 2026). Because each generator defines a distribution rather than a single puzzle instance, the suite supports scalable sampling; the ARC-TGI-50N setting samples 50 episodes per family (Lehmann et al., 5 Mar 2026).
Distributional analyses in the ARC-TGI paper indicate that the generated families preserve dominant size modes of original tasks while expanding coverage through within-family variation. The paper also reports generator-by-model heatmaps that show stable difficulty ordering across models, with long-tail hard families preserved under resampling (Lehmann et al., 5 Mar 2026). This is significant because it suggests that the benchmark is not merely generating interchangeable easy variants but is retaining family-specific structure that remains discriminative across systems.
Representative baseline numbers are reported directly on ARC-TGI-50N. Qwen3-30B reaches 5 exact-match, while Claude Sonnet 4.5 averages approximately 6, and the difficulty matrices exhibit stable, family-specific behavior under resampling (Lehmann et al., 5 Mar 2026). The recommended reporting protocol therefore includes MacroAcc, Acc_held-out, generator-by-model heatmaps, and per-family bar plots, since aggregate accuracy alone can obscure the difficulty structure of the suite (Lehmann et al., 5 Mar 2026).
The paper also specifies a fine-tuning protocol for benchmarking transfer. The described LoRA configuration uses 7, 8, dropout 9, 10 epochs, AdamW with learning rate 0, warmup 1, weight decay 2, and maximum context 14k tokens. Under this protocol, the paper reports significant ID gains for Phi-4 and Llama-3.1-8B, limited transfer to ARC-AGI-1 eval, and generator-specific improvement and decline patterns (Lehmann et al., 5 Mar 2026). A plausible implication is that ARC-Bench exposes not only absolute performance but also heterogeneity in how models respond to synthetic resampling and family-level fine-tuning.
6. Positioning, alternate usages, and open issues
Relative to original ARC-AGI and ARC-Mini, ARC-Bench preserves the open-ended train/test format but adds resampling, solver-facing reasoning templates, partial-evaluation witnesses, and first-class episode-level constraints. This makes possible robustness sweeps and matched-distribution studies such as within-family versus across-family generalization that are infeasible on one-off puzzles (Lehmann et al., 5 Mar 2026). Relative to generator frameworks such as ARC-DSL, ReARC, and ARC-GEN, the ARC-TGI formulation specifically adds step-by-step natural-language reasoning aligned to each sampled instance and a create_grids stage that emphasizes episode-level constraints, together with human validation and self-verifying exports (Lehmann et al., 5 Mar 2026).
The benchmark is also positioned in relation to newer synthetic reasoning suites. CellARC is explicitly presented as a complement and extension, not a replacement: it isolates local-rule induction in multicolor one-dimensional cellular automata, offers unlimited sampling and explicit difficulty knobs such as alphabet size, radius, Langton’s 3, coverage, and cell entropy, and is intended for use alongside ARC-Bench to disentangle generalization due to human priors from formal local-rule induction (Lžičař, 11 Nov 2025). This suggests a broader research program in which ARC-Bench provides human-authored, object-centric abstraction tasks while CellARC provides tightly controlled studies of local-rule inference under reproducible complexity controls.
The term “ARC-Bench” is not fully standardized across the literature. In the GLM-4.5 paper, “ARC-Bench” does not refer to the ARC-TGI suite at all; it denotes the authors’ comprehensive evaluation suite for Agentic, Reasoning, and Coding capabilities, aggregating 12 benchmarks: TAU-Bench, BFCL v3, BrowseComp, MMLU-Pro, AIME 24, MATH-500, SciCode, GPQA, HLE, LiveCodeBench, SWE-bench Verified, and Terminal-Bench. In that usage, the aggregate ARC ranking is the model’s average across those 12 tasks, and representative scores reported for GLM-4.5 are 4 on TAU-Bench, 5 on AIME 24, and 6 on SWE-bench Verified (Team et al., 8 Aug 2025). In encyclopedia usage, the surrounding citation context is therefore essential for disambiguation.
Several limitations remain open in the ARC-TGI-based conception of ARC-Bench. The paper notes coverage gaps despite the 461 released families, persistent long-tail difficulty, the need for stronger automatic checks for degeneracy and ambiguity, opportunities for reusable disambiguation templates and richer symmetry and compositional constraints, and unresolved questions about community-standardized budgets, reporting conventions, and human-solvability audits such as time-to-solve and inter-annotator agreement (Lehmann et al., 5 Mar 2026). This suggests that ARC-Bench is best understood not as a finished benchmark artifact but as a procedural, constraint-aware framework for evaluating abstraction and reasoning under matched distributions with controlled nuisance variation.