Compositional Depth Generalization

Updated 18 June 2026

Compositional Depth Generalization is the capacity of systems to perform deeper, recursive reasoning by modularly reusing learned primitives for algorithmic and hierarchical tasks.
Advanced architectures like Depth-Recurrent Transformers and Neural-Symbolic Stack Machines overcome fixed-depth constraints through shared-weight recurrence and dynamic memory mechanisms.
Empirical benchmarks in tasks such as graph reachability and nested logic reveal a sharp computational frontier where performance transitions rapidly from chance-level to near-perfect accuracy.

Compositional depth generalization is the capacity of learning systems to extrapolate to instances requiring strictly deeper—and more recursive—internal reasoning than observed at training time, by reusing the same compositional primitives modularly. It is a central desideratum for models aiming to perform algorithmic, symbolic, or hierarchical tasks, particularly those involving multi-hop reasoning, deeply nested logic, or variable-length process chains. Recent research has delineated the limitations of standard deep learning architectures for this regime and introduced new models and benchmarks to precisely characterize and advance compositional depth generalization.

1. Formal Definition and Motivation

Compositional depth generalization refers specifically to a model’s ability to solve problems whose minimum required reasoning depth $D$ (the number of sequential, compositional inference steps) exceeds anything encountered during training, using repeated applications of a learned algorithmic primitive. The notion is distinct from mere width or breadth generalization and is critical for domains where solutions involve variable-depth computation (e.g., $D$ -hop graph queries, expressions with $D$ levels of nesting, or recursive parsing).

Traditional fixed-depth architectures, such as Transformers with $L$ layers, are fundamentally limited in this respect: they apply exactly $L$ layers of computation regardless of input complexity, and thus cannot extrapolate to problems of depth $D > L$ . Circuit-theoretically, such models remain within the $\mathsf{TC}^0$ class, which excludes inherently sequential or recursive computations with input-dependent complexity (e.g., multi-hop or loop-like inference) (Chen, 23 Mar 2026).

2. Model Architectures for Depth Generalization

A range of architectures have been explored to overcome the limitations inherent to fixed-depth computation:

Depth-Recurrent Transformers (DRT): DRT decouples parameter count from computational depth by repeatedly applying a single shared-weight Transformer block $T$ times in latent space. At each recurrence, the same parameters $f_\theta$ transform the hidden state $H^{(t)}$ , with gating mechanisms (such as identity-biased recurrence) providing gradient stability over many recurrences. The silent thinking objective supervises only the final output after $D$ 0 steps, which prevents exploitation of shallow heuristics and enforces genuine multi-step reasoning (Chen, 23 Mar 2026).
Compositional Recursive Learners (CRL): CRL introduces an explicit pool of neural modules and a controller trained (typically with policy gradients) to compose these modules recursively. By learning to parse problems into subproblems on a compositional problem graph and assembling solution trajectories of arbitrary depth, CRL achieves strong extrapolation to problems with greater compositional depth than seen during training (Chang et al., 2018).
Neural-Symbolic Stack Machines (NeSS): NeSS leverages a neural controller that generates execution traces, which are then interpreted by a symbolic stack machine capable of unbounded recursion via push/pop operators. This structure allows NeSS to generalize with perfect accuracy to compositional depths (measured by sequence length or parse tree depth) orders of magnitude beyond training (Chen et al., 2020).
Tree Stack Memory Units (Tree-SMU): Tree-SMU embeds differentiable stack memory units within recursive neural networks. By retaining locality and enabling ordered, long-range dependencies through a LIFO stack at each node of a computation tree, Tree-SMU elevates both the depth and productivity of compositional generalization for symbolic mathematical reasoning (Arabshahi et al., 2019).

Many alternative approaches have failed to achieve comparable productivity—specifically, standard MLP, CNN, ResNet, and vanilla Transformer models, all of which exhibit rapidly deteriorating generalization as compositional depth moves out-of-distribution (Klinger et al., 2020, Chen, 23 Mar 2026).

3. Benchmarks and the Computational Frontier

Empirical study of compositional depth generalization requires benchmarks where intrinsic task depth can be precisely controlled and measured. Canonical testbeds include:

Graph reachability: Determining if a node is reachable from another in $D$ 1 hops, where $D$ 2 controls depth. DRT demonstrates a step-function "computational frontier": accuracy jumps from chance to near-perfect exactly when $D$ 3 (Chen, 23 Mar 2026).
Nested logic (Boolean expressions): Parsing and evaluating propositional logic with $D$ 4 nested clauses tests recursive reasoning. Here, the frontier is more gradual, and DRTs degrade gracefully beyond training depth, remaining robust up to $D$ 5 the training range (Chen, 23 Mar 2026).
CFG parsing and SCAN navigation: Parsing or sequencing tasks with recursion-dependent length—NeSS attains 100% generalization even for inputs up to 5000 steps, surpassing all neural-only baselines that collapse at much smaller depths (Chen et al., 2020).
ConceptWorld (relational vision): Classifying images rendered from logical concepts of controlled compositional depth exposes the rapid decline in performance for standard models as depth grows (Klinger et al., 2020).

A common observation is the existence of a "computational frontier": a sharp accuracy transition from chance to high (or from generalization to failure) as a function of inference depth $D$ 6 versus task depth $D$ 7. The shape of this frontier—step-like, gradual, or strictly monotonic—depends on how strongly the model's architecture couples reasoning to explicit depth and what inductive biases are present (Chen, 23 Mar 2026).

Table: Out-of-Distribution Generalization Across Representative Architectures

Model	Task/Domain	Sufficient-Step OOD Acc.	Collapse Depth/Regime
DRT	Graph reachability	100% @ $D$ 8 (up to $D$ 9 train)	$D$ 010 (50%) (Chen, 23 Mar 2026)
DRT	Nested logic	>90% @ $D$ 1 = OOD $D$ 2	Graceful depth degradation
DRT	Unstructured text	81.7% @ $D$ 3	No sharp collapse
NeSS	CFG parsing/SCAN	100% (Test length $D$ 4 5000)	Baselines $D$ 51% (Chen et al., 2020)
CRL	Arithmetic/image	$D$ 680% @ $D$ 7 train depth	Baselines $D$ 840% (Chang et al., 2018)
Tree-SMU	Math: verification	79.6% (depth 8–19 OOD)	Baselines $D$ 961-77%

4. Architectural Mechanisms for Stable Depth Extrapolation

Repeated application of neural primitives over many steps is susceptible to vanishing/exploding gradients, state collapse, or premature convergence to trivial solutions. Advanced models employ several stabilization techniques:

Silent thinking objective: Only final outputs are supervised, precluding shortcut heuristics and enforcing requirement for multi-stage computation (Chen, 23 Mar 2026).
LayerScale initialization: Near-zero initialization of per-channel scaling parameters in residual branches maintains identity flow at early training, protecting untrained states against noise (Chen, 23 Mar 2026).
Identity-biased gated recurrence: GRU-style gating with a strong bias (e.g., $L$ 0) preserves most of the previous state's information at each recurrence, enabling gradients to propagate through $L$ 1 steps (Chen, 23 Mar 2026).
Neural-symbolic interfaces: By separating differentiable control (controller network) and strict recursion (stack machine), models such as NeSS preserve the operational equivariance required for unbounded recursion (Chen et al., 2020).
Tree-structured stacks: For tasks with hierarchical structure, per-node stack memory enables robust skip connections to subtrees at arbitrary depth (Arabshahi et al., 2019).

Each approach is complemented by explicit curriculum learning and, in neural-symbolic setups, a search budget and constraints to prevent degenerate or non-compositional traces (Chang et al., 2018, Chen et al., 2020).

5. Empirical Characterizations and Quantitative Results

Performance under depth generalization is rigorously evaluated by structurally partitioning train/test by compositional depth and measuring accuracy as a function of $L$ 2 (test) versus $L$ 3 (max seen depth):

Productivity ("depth extrapolation"): Models are trained on depths $L$ 4 and tested on $L$ 5. DRT and NeSS exhibit near-100% accuracy for $L$ 6 well above training (Chen, 23 Mar 2026, Chen et al., 2020).
Systematicity and substitutivity: Generalization to new compositions or substitutions—standard architectures fail, with F1 scores dropping $L$ 7 per increment in depth or under unseen substitutions (Klinger et al., 2020).
Tradeoff of depth vs width (parameter budget): Under fixed total parameter budgets, adding additional depth yields sharply increasing OOD generalization in the first few layers, but returns diminish rapidly beyond $L$ 8. Marginal accuracy gains become negligible, and deeper models can even degrade if $L$ 9 (Petty et al., 2023).
Practical recommendations: Optimum point for compositional depth generalization, under latency or computation constraints, is achieved by shallow-to-moderate depth ( $L$ 0– $L$ 1), with increased width to maintain parameter budget (Petty et al., 2023).

6. Limitations and Open Challenges

Although recent architectures achieve marked improvements in compositional depth generalization, several core challenges persist:

Degradation with extreme depth: Even specialized neural models can collapse or degrade when depth grows far outside the training range or inductive bias (especially in vision tasks with ambiguous relational structure) (Klinger et al., 2020).
Architectural superfluity: Memory-augmented structures (e.g., Tree-SMU) impose overhead when depth is unnecessary or task structure is shallow (Arabshahi et al., 2019).
Search/combinatorial budget: Neural-symbolic models reliant on trace search or curriculum require careful tuning and can encounter combinatorial blowup at very large depths (Chen et al., 2020).
Inductive bias and domain gaps: Strong structural priors (e.g., adjacency masking in graphs) yield brittle frontiers, while missing or overly weak priors force networks to discover latent pointer-chasing algorithms and may yield only gradual progress (Chen, 23 Mar 2026).

A plausible implication is that achieving reliable compositional depth generalization across diverse domains will require further integration of symbolic structure, dynamic memory, curriculum learning, and strong architectural priors.

7. Connections to Theoretical Foundations and Future Directions

The central role of circuit depth in enabling compositional generalization has deep theoretical roots. Recurrence or stack-based recursion effectively raises a model out of the $L$ 2 regime, allowing simulation of inherently sequential or context-free computations. The paradigm of "vertical" chain-of-thought reasoning—unrolling a fixed computation primitive for variable or arbitrary depth—complements, rather than supplants, horizontal token-level generation in Transformers (Chen, 23 Mar 2026).

Future directions include automated curriculum learning, richer graph topologies for program induction, tighter type and grammar constraints for symbolic neural models, and further exploration of neural-executor hybrids capable of scaling to deeper and noisier domains (Chang et al., 2018, Chen et al., 2020).

In sum, compositional depth generalization is a well-formalized and practically critical capability, with clear theoretical and empirical boundaries, and a vibrant research trajectory exploring modular, recurrent, and neural-symbolic architectures to unlock true extrapolative generalization across reasoning depths.