Depth-Recurrent Language Models

Updated 11 November 2025

Depth-recurrent language models are neural architectures that loop a shared recurrent block to decouple effective depth from parameter count, enabling flexible compute scaling.
They use specialized training paradigms such as recurrence curriculum and truncated backpropagation through time to expose models to variable depths while managing memory costs.
Empirical results show significant reasoning improvements and parameter efficiency over static transformers, though challenges remain in achieving interpretable multi-step latent reasoning.

Depth-recurrent LLMs are a class of neural LLM architectures that scale computational depth at inference by looping a bank of transformer layers, thereby decoupling effective test-time depth from both parameter count and training cost. This scheme stands in contrast to conventional stack-based transformers, where depth and inference cost are jointly determined by the number of unique layer parameters, and to chain-of-thought prompting, where reasoning steps are externalized as additional tokens. Depth recurrence enables flexible compute scaling, supports latent (non-textual) reasoning, and introduces new axes for architectural and efficiency innovation in natural language processing.

1. Architectural Foundations of Depth Recurrence

Depth-recurrent transformers typically decompose a LLM into three sequential components:

Prelude (P): A shallow, fixed stack of transformer layers (e.g., 2–4), which process token embeddings into local, contextual representations.
Recurrent Block (R): A small sequence of layers (e.g., 4–8), whose parameters are shared and looped for $r$ steps to increase effective model depth. Input to each looped block is typically a combination of the prelude output and the previous recurrent state.
Coda (C): Another short stack (e.g., 2–4) of distinct layers and projection/unembedding heads for output distribution mapping.

Let $x \in V^n$ denote an input token sequence, $h$ the hidden state dimension, $r$ the recurrence depth, $l_P$ , $l_R$ , $l_C$ the number of layers in each block respectively. The forward computation is as follows: $\begin{aligned} e &= P(x) \ s_0 &\sim \mathcal{N}(0,\sigma^2 I) \ s_i &= R(e,\,s_{i-1}),\quad i=1\ldots r \ p &= C(s_r) \end{aligned}$ Here, $R$ embodies weight sharing over $r$ recurrences, directly increasing the model’s effective depth to $l_P + r \cdot l_R + l_C$ with parameter cost only $l_P + l_R + l_C$ .

The recurrence mechanism sometimes utilizes an adapter $W_a$ with $u_i = W_a [e; s_{i-1}]$ , after which $u_i$ is processed via the shared block $Block_R$ . In Huginn-3.5B, the four recurrent blocks $R_1,\,\dots,\,R_4$ are applied in round-robin order, with a small Gaussian noise vector injected at the initial pass (for $R_1$ ).

Positional encodings (absolute or rotary) remain fixed throughout the looping process. No additional memory or gating structures are introduced unless explicitly modified.

2. Depth-Recurrent Training Paradigms

Depth-recurrent models require specialized pretraining approaches to enable robust behavior across varying $r$ :

Recurrence Curriculum: The number of recurrent passes per example is sampled from a heavy-tailed distribution (e.g., Poisson-lognormal) with gradually increasing mean over the course of training. This exposes the model to depths between 1 and $R_{\text{max}}$ (often ramped linearly or via $1-\sqrt{\cdot}$ schedules).
Truncated BPTT: Backpropagation is computed only through the last $K$ recurrence steps (e.g., $K=8$ ), with earlier states treated as frozen. This bounds memory usage regardless of $r$ .
Adapter Initialization and Weight Surgery: For retrofitting onto pretrained transformers, intermediate layers are partitioned into P/R/C roles. Simple linear adapters are added to interface prelude/recurrent outputs, and all adapters use “scalable initialization.” Model parameters outside the adapter or most recent recurrences can be held fixed, reducing optimization cost.
Mixed Precision and Optimizers: Mixed-precision (bfloat16), FlashAttention, and optimizers such as Muon (second-order-inspired) have been used to stabilize and accelerate training.

Training Pseudocode:

for step in range(N_steps):
    mu = min(1, step/(alpha*N_steps)) * R_max
    r = sample_poisson_lognormal(mu)
    s = normal_init(...)
    for i in range(r):
        u = adapter(concat(e, s))
        s = RecBlock(u)
    p = Coda(s)
    loss = cross_entropy(p, target)
    loss.backward(through_last_K_recursions)

3. Probing Depth-Recurrent Models: Interpretability and Reasoning

Empirical characterization of depth-recurrent models employs several probing lenses:

Logit Lens: Projects normalized hidden states at each recurrence to the vocabulary via the unembedding matrix.
Coda Lens: Projects hidden states after an additional normalization and coda pass.

Researchers track the rank of key tokens (final answers or intermediates) through recurring depths to search for latent chain-of-thought (CoT) dynamics, expecting a sequence of phase transitions (e.g., “intermediate” token rank rises before “final” token rank).

Findings from Huginn-3.5B reveal:

Rank trajectories oscillate with the period of the recurrent block; discontinuities appear at specific layers (notably $R_4$ ).
Under the logit lens, certain blocks align with plausible numerical prefixes; others become uninterpretable. Under the coda lens, the interpretability pattern shifts, with no single lens yielding coherent latent CoT across blocks.
Composite operations (“(2×3)+1” → “6”→ “7”): Final tokens always dominate intermediates in rank trajectory, indicating no phase separation as would be consistent with classic CoT.

This suggests that, while (block-dependent) iterative refinement occurs, interpretable, compositional latent reasoning steps do not spontaneously emerge in current architectures (Lu et al., 2 Jul 2025).

4. Reasoning Performance and Empirical Scaling

Depth-recurrent models display competitive scaling characteristics on reasoning tasks as recurrence depth increases—sometimes dramatically outperforming fixed-depth baselines for the same training FLOPs (McLeish et al., 10 Nov 2025, Geiping et al., 7 Feb 2025). Key empirical results include:

Model	Train FLOPs	GSM8K Acc. (%)	MATH Acc. (%)
TinyLlama-1.1B (static)	$3\times10^{18}$	26.6	—
TinyLlama (4,8,4, r=32)	$3\times10^{18}$	52.0	—
OLMo-2-1B (static)	$3\times10^{18}$	—	25.1
OLMo-2-1B (4,6,4, r=32)	$3\times10^{18}$	—	40.6

Performance generally increases with $r$ , e.g., GSM8K accuracy for TinyLlama-1.1B (4,8,4) improves from 17.6% ( $r$ =1) to 45.0% ( $r$ =32). However, for Huginn-3.5B on GSM8K (8-shot, suppressed CoT), accuracy saturates at 4.9% even as $r$ grows to 256, while enabling explicit chain-of-thought prompts boosts strict accuracy to 24.9% and lenient to 38.1% (Lu et al., 2 Jul 2025).

This demonstrates that, although depth recurrence robustly increases effective compute and parameter efficiency, endogenous multi-step reasoning remains underdeveloped without further architectural or training intervention.

5. Computational and Practical Trade-offs

Depth recurrence introduces several practical and theoretical benefits:

Parameter Efficiency: Effective depth is decoupled from parameter count; for fixed memory, arbitrary depth is achievable by increasing recurrences.
Train-Test Compute Decoupling: Models can be trained with lower, variable depths and tested with higher depths, enhancing budget flexibility.
FLOPs Scalability: Inference FLOPs grow linearly in $r$ , but memory remains bounded since all layer parameters are reused.
KV-Cache Optimization: A fixed-size key-value cache (overwritten via modulo addressing) suffices, promoting cache and memory efficiency (Geiping et al., 7 Feb 2025).

Recent work draws a connection between recurrent-depth models and continuous diffusion processes (Geiping et al., 16 Oct 2025). By leveraging parallel “diffusion-forcing” samplers, these models achieve up to $5\times$ faster tokenwise decoding at comparable accuracy, via simultaneous refinement of latent states across positions and recurrences. The theoretical claim is that, for a fixed time budget, the sampler visits a strictly larger set of hidden states in parallel than classic autoregressive decoding.

Empirical benchmarks (Huginn-0125, 3.5B parameters, $r=32$ ) illustrate this:

Task	AR (t/s, acc.)	Diffusion-forcing (t/s, acc.)
GSM8K (8-shot CoT)	36.1 (41.8%)	182 (42.1%)
MATH500	6.4 (17.6%)	35.9 (18.0%)
HumanEval	13.5 (22.6%)	67.4 (20.1%)
MBPP	15.3 (31.6%)	92.3 (27.8%)

A plausible implication is that depth-recurrent models, equipped with diffusion-inspired parallel samplers, combine the parallel decoding strength of diffusion models with the causal structure of autoregressive approaches.

6. Theoretical and Empirical Limits

Despite their potential, depth-recurrent LLMs face substantial interpretability and reasoning limitations:

Latent Reasoning Structure: No compelling evidence of discrete, structured, phase-separated latent CoT trajectories was found in direct probes (Lu et al., 2 Jul 2025).
Oscillatory Hidden Dynamics: Representation trajectories exhibit periodic discontinuities determined by block position, and lens-dependent interpretability complicates mechanistic understanding.
Marginal Reasoning Gains (Unstructured Recurrence): On multi-step reasoning (GSM8K), additional recurrences beyond a moderate threshold yield plateauing gains (<2 percentage points), well below CoT-prompting baselines.

Reported strengths include remarkable parameter efficiency and minor iterative refinement on simple tasks; shortcomings center on the lack of emergent interpretable intermediates and phase separation in latent space, even with large increases in effective depth.

Potential remedies discussed in recent literature include:

Incorporating learned gating or recurrence-aware positional encodings to break symmetry and mark progression through recurrences.
Hybridizing with explicit chain-of-thought prompting or post-hoc integration of external reasoning steps.
Advanced mechanistic probing (e.g., activation patching, mediation analysis) to clarify internal transition dynamics across depth (Lu et al., 2 Jul 2025).

7. Extensions, Relations, and Outlook

Depth-recurrent models are situated at the intersection of universal transformers, recurrent neural architectures (including LSTMs and RNNs), and future scaling approaches:

Universal Transformers: Early universal transformer works looped all transformer blocks with parameter sharing, motivated by Turing completeness and algorithmic capacity (Geiping et al., 16 Oct 2025).
Relation to RNNs: RNNs/LSTMs encode hierarchical languages with bounded stack depth and generalize well on long strings if nesting is shallow, providing theoretical underpinnings for recurrent computation in language (Bhattamishra et al., 2020).
Diffusion LLMs: The parallel between recurrent “denoising” in depth and diffusion processes opens new algorithmic and efficiency frontiers for both sampling and interpretation (Geiping et al., 16 Oct 2025).
Model Surgery and Retrofitting: Techniques for converting pretrained stack-based transformers to depth-recurrent form enable flexible deployment and parameter reuse (McLeish et al., 10 Nov 2025).

Future directions include the exploration of adaptive compute depth per token, the inclusion of heterogeneous or expert layers in the recurrence loop, and systematic mechanistic analysis of latent depth trajectories. Robust internalization of multi-step reasoning, or emergence of interpretable latent chains-of-thought, remains an open challenge.

In sum, depth-recurrent LLMs offer novel and practical mechanisms for compute scaling, memory efficiency, and adaptation to resource budgets. Empirically, they can deliver substantial improvements over static-depth baselines in mathematical reasoning, but fundamental questions about their ability to internalize and structure complex reasoning steps within latent space are not yet resolved. Continued methodological innovation—in training regimens, architectural asymmetries, probing technologies, and hybrid reasoning strategies—will be central to fully realizing their theoretical promise.