Depth-Recurrent Attention Mixtures (Dreamer)

Updated 4 July 2026

The paper demonstrates that integrating depth recurrence with latent reasoning significantly reduces training tokens while maintaining high accuracy.
The architecture employs a single recurrent layer augmented with depth attention, effectively mitigating the hidden-size bottleneck in intermediate computations.
Sparse expert attention in Dreamer scales parameter capacity and optimizes computation, leading to improved performance on rigorous math reasoning benchmarks.

Depth-Recurrent Attention Mixtures, or Dreamer, is a language-model architecture organized around latent reasoning in which difficult reasoning is carried out through many internal computation steps in latent space along a separate depth dimension rather than by emitting long chains of natural-language tokens (Knupp et al., 29 Jan 2026). Its central design combines depth recurrence, depth attention, and sparse expert attention in a modular framework intended to address two bottlenecks of many-step latent computation: the repeated execution cost of large layers and the restricted capacity of a fixed-width hidden state to store intermediate computation (Knupp et al., 29 Jan 2026). Under jointly matched FLOPs, parameter count, and memory usage, the paper reports that Dreamer models require 2–8× fewer training tokens to reach the same accuracy as matched state of the art and can outperform approximately 2× larger matched baselines trained on the same number of tokens (Knupp et al., 29 Jan 2026).

1. Conceptual basis and problem setting

The paper motivates Dreamer by contrasting chain-of-thought reasoning with latent reasoning. In standard LLMs, chain-of-thought pushes reasoning into the sequence dimension, requiring the model to emit and process many extra tokens. The paper characterizes this as computationally costly at both training and inference time and as constraining reasoning to discrete natural language. Dreamer instead pursues latent reasoning through depth recurrence, where the same computation is repeatedly applied across internal depth steps rather than allocating additional reasoning to longer token sequences (Knupp et al., 29 Jan 2026).

Depth recurrence is defined by replacing a conventional layered update

$x_{l+1} = f_l(x_l), \qquad l = 1,\dots,L$

with a shared update

$x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$

The intended consequence is that the model can “think longer” by iterating a shared module over depth without increasing the number of distinct parameters. The paper presents this as analogous to recurrence in RNNs, except that the recurrence is over depth steps rather than sequence time steps (Knupp et al., 29 Jan 2026).

Three limitations of prior depth-recurrent work organize the paper’s problem statement. First, earlier comparisons often did not jointly match FLOPs per token, parameter count, and memory usage, which made it difficult to distinguish architectural gains from resource increases. Second, partially shared architectures with fixed outer stacks are said to underuse depth recurrence by restricting depth-generalized latent reasoning. Third, earlier depth-recurrent designs are described as suffering from a hidden-size bottleneck: as depth increases, more intermediate reasoning must be compressed into a constant-width hidden state. Dreamer’s response is a fully depth-recurrent single-layer core augmented with depth attention and sparse expert attention (Knupp et al., 29 Jan 2026).

A key terminological distinction in the paper is between latent reasoning depth and the number of Transformer layers. In Dreamer’s main depth-recurrent variants, the number of layers is 1, while latent reasoning depth is 16 or 32 recurrent iterations. This separates internal reasoning steps from the number of distinct parameterized blocks (Knupp et al., 29 Jan 2026).

2. Architectural organization

Dreamer is presented as a framework combining three orthogonal attention dimensions: sequence attention (SA), depth attention (DA), and expert attention (EA). Sequence attention is standard causal self-attention over tokens; depth attention is attention over previous depth states of the same token; expert attention is a sparse MoE-style attention over experts. The paper’s guiding claim is that information should be directly accessible along all three axes: across tokens, across latent reasoning steps, and across parameterized experts (Knupp et al., 29 Jan 2026).

The canonical sequential composition is

$\begin{aligned} y_{\mathrm{DA},l} &= x_l + \mathrm{DA}(x_{:l}), \ y_{\mathrm{SA},l} &= y_{\mathrm{DA},l} + \mathrm{SA}(y_{\mathrm{DA},l}), \ y_l &= y_{\mathrm{SA},l} + \mathrm{EA}(y_{\mathrm{SA},l}). \end{aligned}$

For throughput, the experiments use a partially parallel form,

$\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$

and a fully parallel variant is also described. The authors report that extra parallelization yielded about 15% speedup per added parallelization step at the cost of up to 10–20% higher benchmark error rates in small $\sim 1$ B-scale tests, which is why the partially parallel version was retained (Knupp et al., 29 Jan 2026).

The paper distinguishes three named variants:

Variant	Architecture	Distinctive property
LA	classically layered baseline, non-depth-recurrent, sparse MoE-based	no depth recurrence
DR	Dreamer without depth attention	single depth-recurrent layer
DR+DA	Dreamer with depth attention	adds depth attention

In the main Dreamer configurations, there is 1 layer iterated for 16 or 32 depth steps. A notable nuance is that naïve full parameter sharing of all attention projections across depth performed poorly. To preserve full depth recurrence while avoiding this failure mode, the paper turns the fused QKV projection and multi-head attention aggregation into lightweight sparse MoEs of linear experts inside the attention modules themselves. Thus, the recurrent object remains effectively a single layer, but the attention projections and aggregations receive sparse conditional variation across depth (Knupp et al., 29 Jan 2026).

Implementation-specific details further clarify the design. Sequence attention uses RoPE and RMSNorm. Depth attention is implemented with one head and a depth-position RoPE scheme in which half of RoPE is reversed from maximum depth. The paper presents this as an efficiency-oriented choice intended to keep memory movement overhead small while preserving a positional signal along the depth axis (Knupp et al., 29 Jan 2026).

3. Mathematical formulation and routing mechanisms

The paper defines single-head attention as

$\operatorname{Attn}(Q, K, V) = \sigma\left(\frac{QK^T}{\sqrt{k}}\right) V,$

where $Q \in \mathbb{R}^{m \times k}$ , $K \in \mathbb{R}^{m \times k}$ , and $V \in \mathbb{R}^{m \times v}$ . For sequence attention, $m$ is sequence length; for depth attention, $x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 0 is depth. This formalism makes the paper’s central analogy explicit: depth is treated as another attention axis rather than merely as repeated residual processing (Knupp et al., 29 Jan 2026).

Because residual scales can drift across recurrent depth, Dreamer normalizes the recurrent residual stream via

$x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 1

with RMSNorm. The paper presents this normalization as important for learning depth-generalized experts under repeated application of a shared layer (Knupp et al., 29 Jan 2026).

Expert attention is formulated as sparse routing over experts. The sparse activation is

$x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 2

Here, $x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 3 provides score magnitudes, $x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 4 is used for top- $x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 5 selection, and $x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 6 renormalizes the selected scores. The balancing bias is updated by

$x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 7

where $x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 8 counts how often expert $x_{l+1} = f(x_l), \qquad l = 1,\dots,L.$ 9 was used since the last update. The paper describes this as similar in spirit to DeepSeek-V3 routing, but without introducing an auxiliary routing loss (Knupp et al., 29 Jan 2026).

To improve shared attention projections under depth recurrence, the attention projections and aggregation are themselves implemented as lightweight sparse MoEs with linear experts and top-1 routing. A shared always-active expert $\begin{aligned} y_{\mathrm{DA},l} &= x_l + \mathrm{DA}(x_{:l}), \ y_{\mathrm{SA},l} &= y_{\mathrm{DA},l} + \mathrm{SA}(y_{\mathrm{DA},l}), \ y_l &= y_{\mathrm{SA},l} + \mathrm{EA}(y_{\mathrm{SA},l}). \end{aligned}$ 0 is introduced:

with $\begin{aligned} y_{\mathrm{DA},l} &= x_l + \mathrm{DA}(x_{:l}), \ y_{\mathrm{SA},l} &= y_{\mathrm{DA},l} + \mathrm{SA}(y_{\mathrm{DA},l}), \ y_l &= y_{\mathrm{SA},l} + \mathrm{EA}(y_{\mathrm{SA},l}). \end{aligned}$ 2 denoting stop-gradient through the score. Because the experts are linear, this can be rewritten as

which allows the shared expert to be folded into the routable experts at inference time, removing extra compute and memory overhead (Knupp et al., 29 Jan 2026).

The training objective is not presented as a new architecture-specific loss. The paper states that Dreamer uses standard language modeling, and the objective appears to be autoregressive next-token prediction. No additional auxiliary routing loss is introduced; balancing is handled by the bias-update mechanism. This indicates that the paper treats routing stability as an architectural and optimization problem rather than as a separate supervision problem (Knupp et al., 29 Jan 2026).

4. Hidden-size bottleneck, depth memory, and decoupled scaling

One of the paper’s most important conceptual claims is that plain depth recurrence does not by itself solve latent reasoning. In a depth-recurrent model without DA, all intermediate reasoning across many recurrent steps must be compressed into the current hidden state $\begin{aligned} y_{\mathrm{DA},l} &= x_l + \mathrm{DA}(x_{:l}), \ y_{\mathrm{SA},l} &= y_{\mathrm{DA},l} + \mathrm{SA}(y_{\mathrm{DA},l}), \ y_l &= y_{\mathrm{SA},l} + \mathrm{EA}(y_{\mathrm{SA},l}). \end{aligned}$ 4, where $\begin{aligned} y_{\mathrm{DA},l} &= x_l + \mathrm{DA}(x_{:l}), \ y_{\mathrm{SA},l} &= y_{\mathrm{DA},l} + \mathrm{SA}(y_{\mathrm{DA},l}), \ y_l &= y_{\mathrm{SA},l} + \mathrm{EA}(y_{\mathrm{SA},l}). \end{aligned}$ 5 is fixed. The paper describes this as a hidden-size bottleneck analogous to the bottleneck of classic RNN hidden states. Just as sequence self-attention gave each token direct access to previous token states, Dreamer’s depth attention gives each depth step direct access to previous depth states of the same token (Knupp et al., 29 Jan 2026).

Depth attention is therefore presented as a memory over intermediate latent reasoning steps:

where $\begin{aligned} y_{\mathrm{DA},l} &= x_l + \mathrm{DA}(x_{:l}), \ y_{\mathrm{SA},l} &= y_{\mathrm{DA},l} + \mathrm{SA}(y_{\mathrm{DA},l}), \ y_l &= y_{\mathrm{SA},l} + \mathrm{EA}(y_{\mathrm{SA},l}). \end{aligned}$ 7 denotes hidden states from earlier depths. The paper argues that this creates an external memory over depth and alleviates the need to repeatedly overwrite a single fixed-width vector. The reported DA heatmaps are interpreted as evidence that DA is not merely acting like uniform skip connections: early depths look back strongly to the first few depths, later depths rely more on middle or higher depths, and final outputs depend on middle-to-high depths (Knupp et al., 29 Jan 2026).

The scaling argument is equally central. Dreamer is said to decouple scaling dimensions that are entangled in conventional Transformers. In the paper’s formulation, parameter count scales via the number and size of experts, FLOPs scale via sparsity and active experts, reasoning depth scales via the number of recurrent iterations, and latent memory size scales via depth attention context over intermediate depth states. This suggests an architecture in which more reasoning steps do not require more distinct parameters, more parameter capacity does not require dense compute, and more latent memory does not require larger hidden width (Knupp et al., 29 Jan 2026).

The paper’s resource-matching protocol is unusually strict. It matches FLOPs per token, parameter count, and memory usage using a bivariate coordinate descent procedure: first adjusting EA intermediate MLP size until FLOPs match a baseline, then adjusting the number of experts until parameter count matches, and then repeating FLOP matching because routing cost depends slightly on the number of experts. The resulting matches are very tight. For depth 16, LA uses 1.1708B params, 0.9389B FLOPs/token, and 4.7827 GB memory; DR uses 1.1704B params, 0.9389B FLOPs/token, and 4.7822 GB memory; DR+DA uses 1.1708B params, 0.9413B FLOPs/token, and 4.7839 GB memory. For depth 32, LA uses 2.0790B params, 1.6150B FLOPs/token, and 6.6337 GB memory; DR uses 2.0794B params, 1.6199B FLOPs/token, and 6.6343 GB memory; DR+DA uses 2.0788B params, 1.6130B FLOPs/token, and 6.6349 GB memory (Knupp et al., 29 Jan 2026).

The paper also emphasizes the memory profile of DA. Because DA attends over depth rather than over the whole token sequence, and because its KV cache can be overwritten after each token, the memory overhead is described as constant with respect to sequence length. This is presented as one of the main practical arguments for adding DA to a recurrent-depth LLM (Knupp et al., 29 Jan 2026).

5. Experimental evaluation

The empirical evaluation centers on mathematical reasoning benchmarks and math-related language modeling. Reasoning tasks include GSM8K (0-shot), MATH (0-shot), MathQA (0-shot), and the math subset of MMLU, comprising abstract algebra, elementary mathematics, high school mathematics, college mathematics, and high school statistics. Language modeling is evaluated with Maths-College perplexity (Knupp et al., 29 Jan 2026).

The experiments compare LA, DR, and DR+DA at two scales: a depth-16 regime of about 1B parameters and a depth-32 regime of about 2B parameters. Hyperparameters include hidden size $\begin{aligned} y_{\mathrm{DA},l} &= x_l + \mathrm{DA}(x_{:l}), \ y_{\mathrm{SA},l} &= y_{\mathrm{DA},l} + \mathrm{SA}(y_{\mathrm{DA},l}), \ y_l &= y_{\mathrm{SA},l} + \mathrm{EA}(y_{\mathrm{SA},l}). \end{aligned}$ 8, SA query heads $\begin{aligned} y_{\mathrm{DA},l} &= x_l + \mathrm{DA}(x_{:l}), \ y_{\mathrm{SA},l} &= y_{\mathrm{DA},l} + \mathrm{SA}(y_{\mathrm{DA},l}), \ y_l &= y_{\mathrm{SA},l} + \mathrm{EA}(y_{\mathrm{SA},l}). \end{aligned}$ 9, SA KV heads $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 0, head dimension $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 1, DA heads $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 2, DA KV heads $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 3, DA dimension $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 4, EA active experts $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 5, SA context length $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 6, DA context length $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 7 depth, batch size $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 8 tokens, and bfloat16. After matching, expert counts are 32 for both LA variants, 517 and 1039 for DR at depths 16 and 32, and 537 and 1097 for DR+DA at depths 16 and 32. Training uses about 100B training tokens from openly available instruction datasets that are heavily math-focused, with cleaning, deduplication, decontamination, and removal of CoT traces to emphasize latent reasoning over verbalized long-form reasoning (Knupp et al., 29 Jan 2026).

At depth 16, the baseline LA(16) reports PPL 6.74, GSM8K 43.8, MATH 40.9, MMLU-math 26.6, MathQA 22.9, and Avg 32.3. DR(16) reports PPL 6.64, GSM8K 47.6, MATH 47.8, MMLU-math 44.5, MathQA 45.6, and Avg 45.7. DR+DA(16) reports PPL 6.41, GSM8K 51.7, MATH 50.7, MMLU-math 38.5, MathQA 37.1, and Avg 43.4. The paper notes that at this scale both DR variants strongly beat LA, but pure DR outperforms DR+DA on half of the reasoning benchmarks, which the authors attribute to the reduced MLP size needed to FLOP-match DR+DA (Knupp et al., 29 Jan 2026).

At depth 32, LA(32) reports PPL 6.31, GSM8K 49.7, MATH 48.2, MMLU-math 37.2, MathQA 26.0, and Avg 39.5. DR(32) reports PPL 6.12, GSM8K 51.5, MATH 51.3, MMLU-math 45.3, MathQA 43.2, and Avg 47.6. DR+DA(32) reports PPL 5.90, GSM8K 56.3, MATH 54.5, MMLU-math 50.0, MathQA 50.8, and Avg 51.4. At this larger depth, DR+DA is clearly the best model, which the paper interprets as support for the hidden-size bottleneck hypothesis: DA becomes more beneficial as latent reasoning depth increases (Knupp et al., 29 Jan 2026).

The paper also reports benchmark-level data efficiency through a “DE” metric measuring how many fewer training tokens are needed to reach LA’s best accuracy. Reported averages are 4.0× for DR(16), 2.9× for DR+DA(16), 2.6× for DR(32), and 3.5× for DR+DA(32). Specific examples include 8.3× for DR+DA(32) on MathQA, 5.4× for DR(16) on MathQA, 4.8× for DR(16) on MMLU-math, and 4.0× for DR+DA(32) on MMLU-math. One of the paper’s strongest comparative claims is that DR+DA with depth 16 outperforms LA with depth 32 on all reasoning benchmarks, supporting the claim that a roughly 1B Dreamer model can outperform a roughly 2B conventional sparse MoE baseline at equal training tokens (Knupp et al., 29 Jan 2026).

6. Knowledge usage, nomenclature, and open questions

Beyond benchmark scores, the paper analyzes how Dreamer uses experts and intermediate depth states. For a depth-32 DR+DA model, the average DA scores over 1000 validation sequences show structured retrieval patterns rather than uniform depth mixing. The authors interpret this as targeted retrieval of intermediate computations. Expert usage is similarly structured. For the distribution $\begin{aligned} y_{\mathrm{DA+SA},l} &= x_l + \mathrm{DA}(x_{:l}) + \mathrm{SA}(x_l), \ y_l &= y_{\mathrm{DA+SA},l} + \mathrm{EA}_l(y_{\mathrm{DA+SA},l}), \end{aligned}$ 9, about 22% of experts prefer one or two depths in the upper half of the depth range, and the last depth has the most single-depth-dedicated experts. At the same time, about 50% of experts are used in at least 7 depths, indicating substantial knowledge reuse across depth. Using Lorenz curves over expert routing, the paper reports that DR uses 2–11× more experts per depth than LA can use, with a global Gini coefficient of about 0.075 (Knupp et al., 29 Jan 2026).

These analyses support the paper’s interpretation of Dreamer as a system in which early latent steps use broadly reusable knowledge while later latent steps become more specialized. This suggests a division between depth-generalized reasoning subroutines and more depth-specific late-stage computations. The paper frames this as insight into “knowledge capacity allocation and reuse patterns across depths” rather than as a fully settled mechanistic explanation (Knupp et al., 29 Jan 2026).

A recurring source of confusion is the name Dreamer. Depth-Recurrent Attention Mixtures is not a Dreamer-family model-based RL world model. By contrast, NE-Dreamer keeps “Dreamer’s RSSM dynamics and imagination-based actor--critic” while replacing “same-step pixel reconstruction with next-embedding prediction using a causal temporal transformer” (Bredis et al., 3 Mar 2026). The language-model Dreamer of (Knupp et al., 29 Jan 2026) instead studies depth-recurrent latent reasoning in decoder-only language modeling. A plausible implication is that the shared name reflects a loose emphasis on latent internal computation rather than a shared architectural lineage.

In relation to earlier architectures, the paper’s design also resembles a more language-model-centric continuation of recurrent and modular-attention ideas. BRIMs combines modularity, sparse activation, attention routing, and top-down/bottom-up integration in a bidirectional recurrent hierarchy (Mittal et al., 2020). Dreamer differs in being explicitly organized around fully depth-recurrent single-layer latent reasoning, depth attention as memory over intermediate reasoning steps, and sparse expert attention as the principal scaling mechanism (Knupp et al., 29 Jan 2026).

The paper identifies several open questions. Dynamic depth generalization is not fully studied, despite the architecture allowing it in principle. Reliability beyond the trained depth regime remains underexplored. Better position encodings than RoPE for depth may help. DA still has some memory movement overhead. Alternatives such as sliding-window, dilated, linear-attention, or SSM-style depth mechanisms are proposed as future directions. More rigorous scaling laws are still needed. The paper also notes a broader interpretability issue: latent reasoning is less interpretable than explicit CoT and may hide flawed internal reasoning (Knupp et al., 29 Jan 2026).

Taken together, Dreamer’s contribution is to treat latent reasoning as an architectural problem with separate solutions for reasoning depth, latent memory, and parameter capacity. Its central claim is that depth recurrence alone is insufficient: many-step latent reasoning also requires direct access to prior intermediate states through depth attention and economical access to large capacity through sparse expert attention (Knupp et al., 29 Jan 2026).