OOD Generalization in Arithmetic

Updated 18 June 2026

The topic defines OOD generalization in arithmetic as the evaluation of models on tasks with value, structural, and distributional shifts beyond the training regime.
Empirical studies reveal that standard neural models exhibit catastrophic performance drops and specific failure modes such as carry-semantic bottlenecks when faced with OOD arithmetic challenges.
Architectural remedies including recursive latent reasoning, positional attention, and neuro-symbolic hybrids demonstrate significant improvements in handling OOD arithmetic tasks.

Out-of-distribution (OOD) generalization in arithmetic refers to the capacity of models—principally deep sequence models such as Transformers—to robustly solve arithmetic tasks whose form or instance complexity exceeds that encountered in training. In contrast to in-distribution (ID) generalization, where test samples are drawn from the same distribution as training data, OOD generalization quantifies extrapolative reasoning: handling inputs or compositional structure fundamentally outside the model’s prior experience. This capability is central to both the empirical and theoretical study of systematic generalization, algorithmic reasoning, and compositionality in machine learning.

1. Canonical OOD Regimes in Arithmetic

Arithmetic OOD generalization is typically probed along three principal axes:

Value OOD (range extrapolation): Test inputs contain numbers or digits outside the training range (e.g., training on 3-digit addition, testing on 5-digit).
Structural OOD (compositional complexity): Test expressions exhibit deeper/nested computation or greater width (e.g., longer proof trees or graphs, increased nesting).
Distributional OOD (semantic/categorical): Test problems depart from the syntactic or semantic patterns of training data (e.g., new logical forms, proof rules, or operator combinations).

Formally, given a training set $\mathcal{D}_{\mathrm{train}}$ sampled from $P_{\mathrm{train}}$ and a test set from $P_{\mathrm{test}}$ , OOD generalization is measured when $\mathrm{supp}(P_{\mathrm{test}}) \not\subseteq \mathrm{supp}(P_{\mathrm{train}})$ , with performance usually assessed via token- or sequence-level accuracy, recovery rates, or more structural success metrics such as complete symbolic equivalence or fully solved computational graphs (Altabaa et al., 15 Oct 2025, Voigt et al., 24 Sep 2025, Opedal et al., 2024).

2. Empirical Failure Modes of Neural Models on Arithmetic OOD

Most standard neural architectures, including autoregressive Transformers, exhibit severe and systematic OOD generalization failures on arithmetic tasks:

Catastrophic performance drop: Generative models trained on $n$ -digit arithmetic attain $0\%$ accuracy on $(n+2)$ -digit or longer inputs, despite perfect ID performance (Xu et al., 2023, Wang et al., 2021, Shintani, 27 Mar 2026). This collapse is not distributed uniformly over the task space, but rather proceeds through distinct, empirically separable stages:

| Stage | Characterization | Canonical Repair | | ------------------------------ | --------------------------------------------------------------- | --------------------------- | | Layout barrier | Absolute positional embeddings conflate format with semantics | Mixed-layout training | | Carry-semantic bottleneck | Hundreds/thousands places treated as positional flags | Targeted carry probes | | Conditional recomposition | Failure to combine upper and lower digits conditionally | Structured recomposition | | Residual error (tens digit) | Systematic sign-dependent errors in final digits | Sign-aware tens repair |

Each problem requires a targeted rather than a generic data-addition intervention (Shintani, 27 Mar 2026).

Equivalence generalization: Transformers extend their learned function on $\mathbb{Z}_{p}$ to all of $\mathbb{N}$ by mapping OOD arithmetic queries to the equivalence class (i.e., residue modulo $p=10^n$ for $P_{\mathrm{train}}$ 0-digit training). This is a purely algebraic regularity and not true solution extrapolation (Xu et al., 2023).
Shallow compositionality: In-depth studies using symbolic regression and nested expressions show that end-to-end models memorize "point cloud snapshots" and do not reliably synthesize new symbolic structure when training and test supports diverge (Voigt et al., 24 Sep 2025, Petruzzellis et al., 2023).

3. Architectures and Inductive Biases for Arithmetic OOD

Recent work demonstrates that augmenting model architectures with algorithmically aligned mechanisms yields dramatic OOD generalization gains:

Recursive latent space reasoning: Recurrent application of a Transformer block—input-adaptive recurrence—enables layer-wise computation aligned with the depth of the computational structure (e.g., a modular arithmetic DAG). When combined with intermediate-step latent supervision, discrete bottlenecks (anchoring representations to symbolic factors), and explicit error-correction via random factor corruption, models can generalize from graphs of size $P_{\mathrm{train}}$ 1 to OOD graphs with $P_{\mathrm{train}}$ 2 nodes at near $P_{\mathrm{train}}$ 3 accuracy—a regime where all feedforward and standard chain-of-thought methods collapse (Altabaa et al., 15 Oct 2025).
Positional attention and PCOC alignment: Transformers constrained to use attention weights dependent solely on positional encodings can simulate any $P_{\mathrm{train}}$ 4-depth parallel algorithm, including prefix-sum, min, or sort. Empirical results show value-OOD error orders of magnitude lower than standard self-attention when test-time value ranges are extrapolated up to $P_{\mathrm{train}}$ 5 the train interval (Luca et al., 2024).
Grid-cell and DPP-based codes: Embedding inputs into periodic, translation- and scale-equivariant grid-cell representations, then selecting maximally diverse, high-variance supports via determinantal point process (DPP) attention, yields near-perfect OOD accuracy on both additive and moderate multiplicative arithmetic tasks—even as inputs are translated or scaled far outside the training region (Mondal et al., 2023).
Neuro-symbolic hybrids: Pipelined systems that learn simple substitution rules (e.g., innermost subexpression evaluation) and apply them iteratively via a symbolic combiner generalize robustly up to $P_{\mathrm{train}}$ 6 the compositional depth seen in training. This approach outperforms both standard seq2seq models and even large LLMs when evaluating nested arithmetic of unseen complexity (Petruzzellis et al., 2023).

4. Theoretical Analyses: Optimization Bias and OOD Extrapolation

The OOD generalization properties of gradient-trained models on Boolean and arithmetic tasks are increasingly understood through a Fourier-analytic lens:

Min-degree interpolator bias: For sufficiently expressive models (random features, NTK, diagonal deep linear nets, and empirically, Transformers), gradient descent converges to the unique interpolator that minimizes the total Fourier $P_{\mathrm{train}}$ 7 mass at the highest degrees, consistent with an "Occam's razor" principle in the degree-profile of representations. This explains, for instance, the length-generalization failure in parity: limiting the observed Hamming weight in training will guarantee OOD test error outside support, as the model simply extrapolates the lowest-degree fit (Abbe et al., 2023).
Degree-curriculum learning: Leveraging the min-degree bias, curriculum learning—incrementally increasing the support (e.g., Hamming radius)—enables efficient learning of all monomials up to target degree, matching the structure of the true task (Abbe et al., 2023).

5. Compositionality and OOD Benchmarks

Recent evaluation frameworks are specifically designed to probe compositional and proof-theoretic OOD generalization:

MathGAP: This dataset generator samples arithmetic word problems with programmable proof-tree depth, width, and nonlinearity. State-of-the-art LLMs achieve high accuracy on ID splits, but OOD performance decays steeply as proof depth, width, and structural complexity increase, with the strongest models (e.g., GPT-4o) dropping from $P_{\mathrm{train}}$ 8 to $P_{\mathrm{train}}$ 9 on nonlinear deep proofs. Order sensitivity and nonmonotonic context effects highlight the fragility of even the most capable models (Opedal et al., 2024).
Symbolic regression OOD: Pre-trained transformer-based symbolic regression models cannot recover ground-truth formulas when required to extrapolate even modestly outside the pre-training domain. Recovery rates drop from $P_{\mathrm{test}}$ 0 (in-distribution) to $P_{\mathrm{test}}$ 1 (out-of-domain) for leading methods, with only hybrid search-based systems maintaining strong OOD performance (Voigt et al., 24 Sep 2025).

6. Architectural Remedies and Empirical Recommendations

Across studies, the consistent finding is that OOD generalization in arithmetic, as in other algorithmic domains, is fundamentally limited by statistical pattern-matching and the absence of algorithm-aligned inductive biases. Effective remedies demonstrated in the literature include:

Explicit architectural symmetry to align with task structure (e.g., recurrence for DAGs, positional-only attention for index-based algorithms).
Latent or step-wise supervision corresponding to algorithmic stages (layerwise outputs, subexpression supervision).
Discretization or bottleneck mechanisms to anchor representation and prevent drift.
Hybrid neuro-symbolic modeling and structured curricula for gradual compositional complexity (Altabaa et al., 15 Oct 2025, Abbe et al., 2023, Petruzzellis et al., 2023).
Evaluative protocols that measure not only accuracy but also symbolic recovery, sequence length scaling, and error localization by failure mode.

7. Open Problems and Future Directions

Despite recent algorithmic and theoretical advances, true systematic OOD generalization in arithmetic remains unsolved. Outstanding challenges include:

Generalizing across both value and structural OOD simultaneously (e.g., high-arity, deeply nested, and variable-width proofs with large numerics).
Robustness to paraphrasing, permutation, and novel compositional rules in the context of word problems and symbolic regression (Opedal et al., 2024).
Bridging the gap between neural and symbolic representations to achieve human-level extrapolation in arithmetic reasoning (Voigt et al., 24 Sep 2025).
Rigorous sample complexity controls and guarantees for architectures that tightly align with algorithmic domains (Luca et al., 2024).
Extending neuro-symbolic and grid-cell/DPP strategies to broader classes of reasoning tasks.

Emergent research suggests curriculum approaches mixing ID and OOD complexity, explicit compositional mechanisms, and error-factor supervision as promising directions. Standardizing OOD evaluation with public benchmarks and reporting both accuracy and symbolic recovery metrics is emphasized as a necessary baseline for progress (Voigt et al., 24 Sep 2025, Opedal et al., 2024, Altabaa et al., 15 Oct 2025, Petruzzellis et al., 2023).