Papers
Topics
Authors
Recent
Search
2000 character limit reached

Encode–Think–Decode (ETD) for LLMs

Updated 30 January 2026
  • Encode–Think–Decode (ETD) is a method that divides transformer layers into encoder, thinker, and decoder blocks to enable multi-step latent reasoning.
  • ETD employs mid-training parameter tying and recursive processing to boost performance on reasoning benchmarks while preserving the original model architecture.
  • ETD integrates an adaptive ACT routing mechanism to dynamically control recursion depth, balancing computational cost and reasoning accuracy.

Encode–Think–Decode (ETD) is a method for amplifying the reasoning capabilities of LLMs through recursive processing of a select subset of reasoning-relevant layers. ETD is designed for causal-decoder transformer architectures, using an explicit division of model layers into encoder, thinker, and decoder blocks. Unlike standard chain-of-thought scaling or architectural modifications, ETD leverages interpretability findings that reasoning-relevant computation is concentrated in a narrow layer range. The approach involves mid-training parameter tying within the thinker block and test-time recursion to enable multi-step latent reasoning, yielding substantial performance improvements on established reasoning benchmarks while preserving the original architecture, parameter count, training data, and hyperparameters (Koishekenov et al., 8 Oct 2025).

1. Pipeline Structure and Layer Assignments

ETD decomposes a transformer of total depth LL into three contiguous blocks:

  • Encoder (EE): The first NEN_E layers map token embeddings into a latent space enriched with retrieved knowledge.
  • Thinker (TT): The subsequent NTN_T layers form a core block whose parameters are tied and recursively applied kk times during inference and mid-training.
  • Decoder (DD): The final NDN_D layers map refined latent states to vocabulary logits for output prediction.

The full pipeline for a token sequence xRT×dx \in \mathbb{R}^{T \times d} proceeds as follows:

  • h0=E(x)h^0 = E(x) (after encoder layers EE0),
  • For EE1, EE2,
  • Final decoding: EE3, with output EE4.

A prototypical configuration for the OLMo-2 1B base model uses EE5, EE6, EE7, denoted “7–4(*k)–5”. FLOPs per token scale as EE8, so recursion increases inference cost linearly with EE9.

2. Formal Recursive Reasoning

The recursive reasoning mechanism relies on parameter sharing and iterative latent state refinement:

  • At each step, the thinker block processes the hidden state, accumulating residual updates via its fixed parameter set.
  • The recursion produces a sequence NEN_E0 with NEN_E1, deepening the latent representation relevant for reasoning.
  • After NEN_E2 iterations, the decoder block converts NEN_E3 into output logits.

This recursion is defined exactly,

NEN_E4

Each residual block applies multi-head self-attention and MLP sublayers, with standardized transformer update rules. The mechanism preserves the model’s original parameterization except for the recursive thinker block.

3. Mid-Training Integration and Layer Selection

ETD is integrated during a supplementary "mid-training" phase, followed by standard finetuning/regular training. The method:

  • Starts from a pretrained model.
  • Allocates 5–10% of total training FLOPs and 1.25% of tokens for mid-training.
  • Replays original data and hyperparameters, replacing the single forward pass through layers NEN_E5 with NEN_E6 tied recursive passes.
  • Maintains the original cross-entropy loss, with no additional regularization.

Layer selection is driven by per-layer angular distance analysis:

  • For activations NEN_E7, compute

NEN_E8

across many validation sequences.

  • Use the Kneedle algorithm to locate the “knee” in the curve NEN_E9 (where computation transitions from knowledge retrieval to reasoning) to set TT0, with mirroring for TT1.
  • The intermediary block forms the thinker, TT2.

4. Adaptive Depth via ACT Routing

Rather than using a fixed recursion depth TT3, ETD supports adaptive per-token recursion using Adaptive Computation Time (ACT):

  • For each iteration, a router head computes halting probability TT4.
  • The cumulative halting score TT5 determines stopping: recursion halts for a token when TT6 (typically TT7), or upon reaching a cap TT8.
  • The final state TT9 is provided to the decoder.
  • Router parameters are learned end-to-end via standard token prediction loss.

This mechanism allows dynamic allocation of computation: demanding reasoning tasks elicit deeper recursion, while simple tokens incur minimal compute.

5. Empirical Results and Benchmark Analysis

Extensive experiments on the OLMo-2 1B base using the ETD architecture show:

Task Type Baseline Accuracy ETD Accuracy (NTN_T0 value) Relative Gain
GSM8K (math) 44.05 56.56 (NTN_T1) +28.4%
MATH 4.57 6.22 (NTN_T2) +36.0%
Reading Comprehension +12.1% (NTN_T3)
Multi-disciplinary +12.4%
Commonsense +4.8%
BBH +5.3%
Factual Knowledge ±1% change negligible

Ablation studies reveal that recursive processing confined to the reasoning-relevant block (NTN_T4) provides superior compute-accuracy trade-off versus looping all layers or middle-only recursion. Alternative start points for the recursive block were evaluated (e.g., 1–4(*2)–11 up to 11–4(*2)–1); optimal gains consistently occurred for the selection identified by angular distance (“7–4(*2)–5”).

The adaptive ACT strategy surpasses even maximal fixed-depth recursion on demanding reasoning tasks (e.g., DROP, OpenBookQA) while using fewer iterations on average. On tasks with diminishing marginal returns, ACT halts early and conserves computation.

6. Architectural Constraints and Compute Analysis

ETD preserves architectural and parameter fidelity:

  • No additional parameters are introduced, except the router head for ACT.
  • The main blocks NTN_T5, NTN_T6, NTN_T7 are parameter-tied and unchanged from the base model.
  • Inference cost per token scales linearly with NTN_T8; for OLMo-2 1B, FLOPs rise from 16 to 32 as NTN_T9 increases from 1 to 5.
  • ACT strategy typically averages 3–6 recursive steps, reducing compute overhead for easy tokens.

To reproduce ETD, one requires access to the base model’s mid-training protocol and checkpoints, as well as the capability to tie parameters across the recursive thinker subset and insert the ACT router. This suggests potential for broad backward compatibility with existing transformer-based LLMs, contingent on training infrastructure.

7. Interpretation and Implications

The empirical and methodological findings characterize ETD as a mechanism for augmenting LLM reasoning without scaling parameters or training data. The recursive latent thinking focuses compute where it most contributes to performance, as determined by layer-wise interpretability and dynamic routing. A plausible implication is that future advances in LLM reasoning may further benefit from selective recursion and computation routing, rather than uniform scaling of architectures. These results demonstrate that in-the-loop amplification of reasoning, as realized by ETD, shifts the paradigm for test-time reasoning in large-scale neural LLMs (Koishekenov et al., 8 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Encode–Think–Decode (ETD).