Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

Don't throw the baby out with the bathwater: How and why deep learning for ARC (2506.14276v1)

Published 17 Jun 2025 in cs.AI and cs.LG

Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems. Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc. The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains. Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time. We demonstrate that fully committing to deep learning's capacity to acquire novel abstractions yields state-of-the-art performance on ARC. Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks. Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning. We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques. We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT. An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%). Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates a novel deep learning approach that integrates Test-Time Fine-Tuning (TTFT) and AIRV to achieve state-of-the-art ARC performance, with accuracy improvements up to 58%.
  • It leverages an LLM with in-context learning, direct output generation, multi-task training, and synthetic riddle generation to enhance both perceptual and qualitative reasoning.
  • The approach adapts dynamically during inference using robust augmentations and fine-tuning strategies, overcoming the limitations of frozen models in complex ARC tasks.

This paper presents a novel deep learning methodology for tackling the Abstraction and Reasoning Corpus (ARC-AGI) (2506.14276), a challenging benchmark for AI that requires inferring underlying patterns from a few examples. The authors argue that ARC tasks are primarily perceptual and qualitative, and thus, the deep learning paradigm—encompassing both the neural network architecture and the optimization process—is well-suited if applied dynamically during inference. Their approach achieves state-of-the-art results on the ARC private test set, demonstrating significant improvements over baseline methods.

The core contributions revolve around enhancing LLMs for ARC reasoning through specific pre-training strategies and innovative test-time techniques: Test-Time Fine-Tuning (TTFT) and Augment Inference Reverse-Augmentation and Vote (AIRV).

Solution Part 1: Emphasizing In-Context Learning (ICL)

The foundation of the solution is an LLM fine-tuned for ARC, leveraging its inherent in-context learning capabilities.

  1. Model Choice and Input Representation:
    • The LongT5 encoder-decoder model is used, chosen for its extended context length and the non-causal attention mechanism in its encoder. Non-causal attention allows every token in the input to attend to every other token, which is crucial for understanding the relationships between all provided examples in an ARC task simultaneously.
    • ARC riddles (tasks) are formatted into a single text sequence. Input and output grids are unrolled row-wise, pixel colors are represented numerically, and rows are separated by spaces. Keywords like "solve:", "train input1", "output1", "test tinput1" structure the prompt. An example prompt format is:
      1
      
      solve: train input1 <grid_data> output1 <grid_data> train input2 <grid_data> output2 <grid_data> test tinput1 <grid_data> toutput1
      The model is trained to predict the pixel sequence for toutput1.
  2. Direct Output Generation: The model directly generates the output grid's pixel sequence. This is contrasted with approaches that generate code as an intermediate step, which the authors argue is a harder task than direct prediction.
  3. Multi-task Training: To enhance contextual reasoning and reduce overfitting to ARC-specific patterns, the model is trained on a mixture of ARC tasks and various NLP datasets (e.g., SQuAD, instruction-following datasets). This forces the model to manage diverse contexts.
  4. Code Pre-training: Pre-training on coding tasks provides a significant performance boost on ARC. Code datasets demand meticulous attention to detail, contextual understanding (tracking variables, dependencies), and hierarchical reasoning, which are beneficial for ARC.
  5. Automatic Riddle Generators: To expand the ARC training data, synthetic riddles are generated using Domain Specific Languages (DSLs) inspired by existing ARC community tools. The model is trained not only to predict the output grid but also the DSL function names and parameters that could have generated the riddle. This dual-prediction strategy improves robustness. Generated riddles are often "over-specified" to ensure the model can unambiguously determine the solution, preventing it from learning to encode ambiguities. Appendix A details various synthetic data sources, including arithmetic tasks, multimodal grid translations, extended PCFG datasets, cellular automata, and mathematical patterns, often using frameworks like ARC_gym.

    Example of synthetic data categories:

    • Mirror-removal, fill-in-the-shapes
    • Fractal generation (mono and multi-color)
    • Area-repair based on mathematical equations
    • Cellular automata evolution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Pseudocode for DSL-based riddle generation and training
def generate_dsl_riddle():
    function_name = sample_dsl_function()
    params = sample_params_for_function(function_name)
    train_examples = []
    for _ in range(NUM_TRAIN_PAIRS):
        input_grid = generate_random_grid()
        output_grid = apply_dsl_function(function_name, params, input_grid)
        train_examples.append((input_grid, output_grid))
    
    test_input_grid = generate_random_grid()
    test_output_grid = apply_dsl_function(function_name, params, test_input_grid)
    
    return train_examples, test_input_grid, test_output_grid, function_name, params

# During training:
# model_input = format_riddle_for_LLM(train_examples, test_input_grid)
# target_output = format_grid(test_output_grid)
# target_dsl_info = f"function: {function_name} params: {params_to_string(params)}"
# loss = model.train(model_input, [target_output, target_dsl_info])

Solution Part 2: Optimizing in the Evaluation Loop (Test-Time Fine-Tuning - TTFT)

TTFT involves adapting the pre-trained model to each specific ARC task encountered during evaluation.

  1. TTFT Data Generation: For a given test riddle, one of its demonstration (train) grid-pairs (xj,yj)(x_j, y_j) is selected to serve as a new "test example." The remaining demonstration pairs form the "training examples" for this new, smaller, synthetic riddle. Since yjy_j is known, this synthetic riddle has a ground truth, allowing for supervised fine-tuning.
  2. Augmentations for TTFT: To create more training data from the few examples in a riddle, several augmentations are applied to these synthetic riddles:
    • Color permutation: Randomly shuffling color labels.
    • Spatial transformations: Rotations and flips from the dihedral group D4D_4.
    • Shuffling: Reordering the demonstration examples.
  3. Fine-tuning Process: The model undergoes a brief period of fine-tuning (full parameter updates) on these augmented, synthetic riddles derived from the current test task's examples before predicting the actual test instances of that task.
    • Motivation: TTFT allows the model to "reframe" its understanding of the task. Initial hypotheses might be incorrect due to limited processing or misinterpretation. TTFT provides feedback, akin to a human iteratively refining their approach. It also helps the model specialize its execution capabilities for the specific transformations in the current riddle, improving pixel-level accuracy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Pseudocode for TTFT
def perform_ttft(model, original_riddle_train_examples, original_riddle_test_inputs):
    original_weights = model.get_weights()
    synthetic_ttft_data = []
    for i in range(len(original_riddle_train_examples)):
        # Create a new mini-riddle for TTFT
        ttft_test_pair = original_riddle_train_examples[i]
        ttft_train_examples = original_riddle_train_examples[:i] + original_riddle_train_examples[i+1:]
        
        if not ttft_train_examples: continue # Need at least one train example for the mini-riddle

        # Augment this mini-riddle
        for _ in range(NUM_TTFT_AUGMENTATIONS):
            augmented_train, augmented_test_input, augmented_test_output = augment_mini_riddle(
                ttft_train_examples, ttft_test_pair[0], ttft_test_pair[1]
            )
            synthetic_ttft_data.append(
                (format_riddle_for_LLM(augmented_train, augmented_test_input), 
                 format_grid(augmented_test_output))
            )
    
    # Fine-tune the model on this synthetic data
    model.fine_tune(synthetic_ttft_data)
    
    # Perform inference on original test inputs
    predictions = []
    for test_input in original_riddle_test_inputs:
        # Note: The prompt now includes the original_riddle_train_examples
        prompt = format_riddle_for_LLM(original_riddle_train_examples, test_input)
        predictions.append(model.predict(prompt))
        
    model.set_weights(original_weights) # Restore original weights for the next riddle
    return predictions

  1. Attention and Masking: The non-causal attention in the LongT5 encoder is crucial here, allowing holistic processing of riddle examples. Causal decoders would struggle as earlier tokens wouldn't see later parts of the riddle (e.g., output grids when processing input grids).
  2. Beam Search: Used during decoding to generate output grids autoregressively. It maintains multiple candidate solutions, improving robustness over greedy decoding.

Augment, Inference, Reverse-augmentation and Vote (AIRV)

AIRV is a test-time augmentation strategy to improve prediction consistency.

  1. Process:
    • Augment: Apply a spatial transformation (e.g., rotation, flip from D4D_4) to the entire input riddle (all train and test grids).
    • Inference: Run the (potentially TTFT-ed) model on this transformed riddle to get a predicted output grid.
    • Reverse-augmentation: Apply the inverse of the initial spatial transformation to the predicted output grid.
    • Vote: Collect predictions from multiple different augmentations. The most frequent prediction (exact grid match after reversing) is chosen as the final answer.

    AIRV Process Diagram (Adapted from Figure 1 in the paper) The diagram shows: original riddle -> (1) Augment (e.g., rotate) -> (2) Inference on augmented riddle -> (3) Reverse augmentation on prediction -> (4) Vote among multiple such reversed predictions.

  2. Motivation: AIRV can generate duplicate predictions after reversal, allowing a voting mechanism to amplify consistent solutions and filter noise. This is effective for ARC where there's only one correct solution.

Results and Analysis

  • Dataset Splits: ARC (400 train, 400 public eval, 100 private eval). Results are on the private test set.

  • Testing Setup: 2 hours runtime, single P100 GPU (16GB VRAM), top-2 exact match accuracy.

  • Performance:

    • The fully trained base LongT5 model achieved:
    • 5% accuracy (Zero Shot)
    • 13% accuracy (AIRV Only) - a 2.6x improvement over Zero Shot.
    • 39% accuracy (TTFT + AIRV) - a further 3x improvement over AIRV Only.
    • A later version achieved 58% on the ARC private test set.
  • Model Size vs. Pre-training:
    • Larger models generally perform better in Zero Shot and AIRV Only settings, even with less ARC-specific training data. This suggests model capacity (forward pass flexibility) is key for ICL.
    • However, extensive ARC pre-training significantly boosts performance in the TTFT + AIRV setting, more so than just increasing model size. The authors hypothesize that pre-training allows "core knowledge priors" and ARC-specific heuristics (e.g., preference for simpler transformations) to "sediment" into earlier layers of the network. This "sedimentation" creates "room" for new, task-specific features to emerge during TTFT in later layers, enabling more complex reasoning.

Implementation Considerations

  • Computational Requirements: The approach is designed to run within the Kaggle constraints (2 hours on a P100). TTFT adds to inference time, but its benefits are substantial.
  • Data Diversity: A rich and diverse training set, including synthetic data targeting various reasoning primitives, is crucial. The appendix lists numerous public and custom datasets for language, code, math, and visual reasoning.
  • Hyperparameters: The number of TTFT steps, learning rate for TTFT, number of augmentations for AIRV, and beam search width are important hyperparameters.
  • Limitations of Alternatives: The paper argues that purely symbolic or program synthesis approaches often require significant human-engineered heuristics and struggle with the perceptual breadth of ARC tasks. Frozen LLMs, without fine-tuning or specific test-time adaptation, perform poorly.

Conclusion

The paper demonstrates that a deep learning approach, specifically using LLMs with tailored pre-training and sophisticated test-time adaptation (TTFT and AIRV), can achieve state-of-the-art performance on the ARC benchmark. The key is to leverage the learning capability of the model (NN + optimizer) during inference itself, allowing it to adapt to novel tasks on the fly. Extensive pre-training helps instill necessary priors, while TTFT enables dynamic reframing and specialization.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Authors (2)

X Twitter Logo Streamline Icon: https://streamlinehq.com