Encode–Think–Decode (ETD) for LLMs
- Encode–Think–Decode (ETD) is a method that divides transformer layers into encoder, thinker, and decoder blocks to enable multi-step latent reasoning.
- ETD employs mid-training parameter tying and recursive processing to boost performance on reasoning benchmarks while preserving the original model architecture.
- ETD integrates an adaptive ACT routing mechanism to dynamically control recursion depth, balancing computational cost and reasoning accuracy.
Encode–Think–Decode (ETD) is a method for amplifying the reasoning capabilities of LLMs through recursive processing of a select subset of reasoning-relevant layers. ETD is designed for causal-decoder transformer architectures, using an explicit division of model layers into encoder, thinker, and decoder blocks. Unlike standard chain-of-thought scaling or architectural modifications, ETD leverages interpretability findings that reasoning-relevant computation is concentrated in a narrow layer range. The approach involves mid-training parameter tying within the thinker block and test-time recursion to enable multi-step latent reasoning, yielding substantial performance improvements on established reasoning benchmarks while preserving the original architecture, parameter count, training data, and hyperparameters (Koishekenov et al., 8 Oct 2025).
1. Pipeline Structure and Layer Assignments
ETD decomposes a transformer of total depth into three contiguous blocks:
- Encoder (): The first layers map token embeddings into a latent space enriched with retrieved knowledge.
- Thinker (): The subsequent layers form a core block whose parameters are tied and recursively applied times during inference and mid-training.
- Decoder (): The final layers map refined latent states to vocabulary logits for output prediction.
The full pipeline for a token sequence proceeds as follows:
- (after encoder layers 0),
- For 1, 2,
- Final decoding: 3, with output 4.
A prototypical configuration for the OLMo-2 1B base model uses 5, 6, 7, denoted “7–4(*k)–5”. FLOPs per token scale as 8, so recursion increases inference cost linearly with 9.
2. Formal Recursive Reasoning
The recursive reasoning mechanism relies on parameter sharing and iterative latent state refinement:
- At each step, the thinker block processes the hidden state, accumulating residual updates via its fixed parameter set.
- The recursion produces a sequence 0 with 1, deepening the latent representation relevant for reasoning.
- After 2 iterations, the decoder block converts 3 into output logits.
This recursion is defined exactly,
4
Each residual block applies multi-head self-attention and MLP sublayers, with standardized transformer update rules. The mechanism preserves the model’s original parameterization except for the recursive thinker block.
3. Mid-Training Integration and Layer Selection
ETD is integrated during a supplementary "mid-training" phase, followed by standard finetuning/regular training. The method:
- Starts from a pretrained model.
- Allocates 5–10% of total training FLOPs and 1.25% of tokens for mid-training.
- Replays original data and hyperparameters, replacing the single forward pass through layers 5 with 6 tied recursive passes.
- Maintains the original cross-entropy loss, with no additional regularization.
Layer selection is driven by per-layer angular distance analysis:
- For activations 7, compute
8
across many validation sequences.
- Use the Kneedle algorithm to locate the “knee” in the curve 9 (where computation transitions from knowledge retrieval to reasoning) to set 0, with mirroring for 1.
- The intermediary block forms the thinker, 2.
4. Adaptive Depth via ACT Routing
Rather than using a fixed recursion depth 3, ETD supports adaptive per-token recursion using Adaptive Computation Time (ACT):
- For each iteration, a router head computes halting probability 4.
- The cumulative halting score 5 determines stopping: recursion halts for a token when 6 (typically 7), or upon reaching a cap 8.
- The final state 9 is provided to the decoder.
- Router parameters are learned end-to-end via standard token prediction loss.
This mechanism allows dynamic allocation of computation: demanding reasoning tasks elicit deeper recursion, while simple tokens incur minimal compute.
5. Empirical Results and Benchmark Analysis
Extensive experiments on the OLMo-2 1B base using the ETD architecture show:
| Task Type | Baseline Accuracy | ETD Accuracy (0 value) | Relative Gain |
|---|---|---|---|
| GSM8K (math) | 44.05 | 56.56 (1) | +28.4% |
| MATH | 4.57 | 6.22 (2) | +36.0% |
| Reading Comprehension | — | +12.1% (3) | — |
| Multi-disciplinary | — | +12.4% | — |
| Commonsense | — | +4.8% | — |
| BBH | — | +5.3% | — |
| Factual Knowledge | ±1% change | — | negligible |
Ablation studies reveal that recursive processing confined to the reasoning-relevant block (4) provides superior compute-accuracy trade-off versus looping all layers or middle-only recursion. Alternative start points for the recursive block were evaluated (e.g., 1–4(*2)–11 up to 11–4(*2)–1); optimal gains consistently occurred for the selection identified by angular distance (“7–4(*2)–5”).
The adaptive ACT strategy surpasses even maximal fixed-depth recursion on demanding reasoning tasks (e.g., DROP, OpenBookQA) while using fewer iterations on average. On tasks with diminishing marginal returns, ACT halts early and conserves computation.
6. Architectural Constraints and Compute Analysis
ETD preserves architectural and parameter fidelity:
- No additional parameters are introduced, except the router head for ACT.
- The main blocks 5, 6, 7 are parameter-tied and unchanged from the base model.
- Inference cost per token scales linearly with 8; for OLMo-2 1B, FLOPs rise from 16 to 32 as 9 increases from 1 to 5.
- ACT strategy typically averages 3–6 recursive steps, reducing compute overhead for easy tokens.
To reproduce ETD, one requires access to the base model’s mid-training protocol and checkpoints, as well as the capability to tie parameters across the recursive thinker subset and insert the ACT router. This suggests potential for broad backward compatibility with existing transformer-based LLMs, contingent on training infrastructure.
7. Interpretation and Implications
The empirical and methodological findings characterize ETD as a mechanism for augmenting LLM reasoning without scaling parameters or training data. The recursive latent thinking focuses compute where it most contributes to performance, as determined by layer-wise interpretability and dynamic routing. A plausible implication is that future advances in LLM reasoning may further benefit from selective recursion and computation routing, rather than uniform scaling of architectures. These results demonstrate that in-the-loop amplification of reasoning, as realized by ETD, shifts the paradigm for test-time reasoning in large-scale neural LLMs (Koishekenov et al., 8 Oct 2025).