Papers
Topics
Authors
Recent
Search
2000 character limit reached

Decoder-Only Transformer Models

Updated 12 February 2026
  • Decoder-only Transformer models are neural architectures that use causal self-attention and feed-forward layers to autoregressively predict tokens.
  • They integrate enhancements like rotary position embeddings, dynamic layer selection, and context compression to optimize performance and efficiency.
  • Scaling laws and Turing-completeness results highlight their theoretical universality and practical impact in state-of-the-art language generation.

A decoder-only Transformer LLM is a neural architecture that autoregressively predicts tokens in a sequence using a stack of self-attention and feed-forward layers, each operating under a causal (left-to-right) constraint. This architecture underpins the state-of-the-art generation capabilities of modern LLMs such as GPT, OPT, and their derivatives across diverse domains. The following sections review core architectural principles, mathematical properties, inference and efficiency optimizations, scaling behaviors, and research directions, drawing extensively from recent arXiv literature.

1. Core Architecture and Mathematical Formalism

A standard decoder-only Transformer consists of L sequentially stacked blocks, each containing multi-head self-attention (MHSA) with causal masking, feed-forward sublayers (often with SwiGLU/KV variants), and normalization layers (typically RMSNorm or LayerNorm in a pre-norm configuration). The model processes an input sequence x1:Tx_{1:T} as follows:

  • The initial embedding e(x1:T)e(x_{1:T}) encodes tokens and positions (usually via rotary position encodings in modern LLMs).
  • Each decoder block dld^l updates hidden states using

hl=dl(hl1),whereh0=e(x1:T).h^l = d^l(h^{l-1}), \quad \text{where} \quad h^0 = e(x_{1:T}).

  • Only left-context (tokens x<tx_{<t}) is available at position tt, enforced by a causal mask Mijcausal=0M_{ij}^{\text{causal}} = 0 if jij \leq i, -\infty else.

At the output, a softmax classifier predicts the next token:

p(xtx<t)=softmax(WhTL).p(x_t | x_{<t}) = \mathrm{softmax}(W^\top h_T^L).

This uni-directional setup ensures token generation is strictly autoregressive. All architectural weights (including embeddings) are typically tied for parameter efficiency (Zhang et al., 30 Oct 2025).

Recent works adopt enhancement strategies:

  • Rotary position embeddings facilitate context-length extrapolation and more stable optimization for long-range dependencies.
  • SwiGLU or similar gated feed-forward networks improve expressivity and gradient flow.
  • Grouped-query and flash attention variants are used for efficiency at scale.

2. Theoretical Expressivity and Universality

Under the assumptions of arbitrary precision, rational weights, and hard (non-softmax) attention, a single-layer, single-head decoder-only transformer can simulate any RNN and is thus Turing complete, as formalized in (Roberts, 2023). The core proof relies on embedding both the input and the RNN hidden state into the high-dimensional transformer state, using attention to select relevant context and a feedforward network for state updates. The required property is that dmodel>2dembedd_{\text{model}} > 2 d_{\text{embed}} to allow disjoint storage of input and hidden state.

The decoder-only model is structurally equivalent to a causal B machine (Wang): a single-tape Turing machine that can read any prior cell but only append new outputs. This formal result indicates that, in principle, even restricted to causal self-attention, the architecture is computationally universal provided sufficient dimension and model depth. In practice, the need for parameter efficiency and single-pass computation motivates deep, multi-head stacks instead of recursive construction.

3. Scaling Laws and Model Capacity

Empirical investigations demonstrate that decoder-only LLMs exhibit classic power-law and Chinchilla-style scaling in loss as a function of parameter and data size (Caillaut et al., 2024, Zhang et al., 30 Oct 2025). The cross-entropy loss LL on test data follows:

L(N)=αNp+βL(N) = \alpha N^{-p} + \beta

where NN is parameter count. For joint scaling with data:

L(N,D)=E+aNα+bDβL(N, D) = E + \frac{a}{N^\alpha} + \frac{b}{D^\beta}

where DD is number of training examples.

Critically, for both general and multilingual machine translation tasks, scaling exponents are nearly identical to those of encoder-decoder LMs. However, scaling laws do not reliably extrapolate beyond the largest observed model/data regime or transfer across domains without adaptation (Caillaut et al., 2024).

Empirical studies find that depth or width increases contribute similarly to loss reduction for fixed FLOP budgets, but width increases yield greater throughput on GPUs, whereas depth increases may be preferable for certain parallelization constraints.

4. Inference, Efficiency, and Architectural Variants

4.1 Memory and Speed Optimizations

The critical efficiency challenge for decoder-only models is the quadratic time and memory complexity of self-attention, particularly at long context lengths. To address this:

  • YOCO ("You Only Cache Once"): Splits the stack into a self-decoder (with efficient attention such as sliding-window or gated retention) and a cross-decoder that attends exclusively to the global key-value cache generated by the self-decoder (Sun et al., 2024). This yields an O(Nd)\mathcal{O}(N d) vs. O(LNd)\mathcal{O}(L N d) cache requirement and dramatically accelerates prefill, scaling to million-token contexts with high retrieval accuracy.
  • Dynamic Layer Selection: Dynamic inference methods such as layer skipping and early exiting can dramatically reduce inference cost (Glavas et al., 2024). Layer skipping, especially with uniform schedules or per-sequence allocation via an oracle controller, preserves model accuracy while using only a fraction of the layers. Hidden-state–based controllers offer little advantage over token-agnostic ones in current systems.
  • Context Compression: Dodo introduces dynamic contextual compression, retaining only selected "nugget" hidden states past a designated layer (Qin et al., 2023). This method achieves nearly lossless autoencoding at 10–20×\times compression and substantially reduces both memory and compute costs for long-context applications.

4.2 Parameter and Architectural Compression

Variants designed for smaller/faster inference include:

  • ParallelGPT (p-gpt): Splits layer stacks into parallel subcomponents with learnable fusion, trading small parameter increases for parallelization and dynamic inference flexibility.
  • LinearlyCompressedGPT (lc-gpt) and ConvCompressedGPT (cc-gpt): Interleave progressive hidden-size halving (via linear/conv layers) between block groups, yielding 36% parameter reduction and ∼20% training speedup with minimal perplexity degradation (Suresh et al., 2024).

4.3 Improved Masking and Position Encoding

Standard causal masking enforces attention to all prior tokens, even when unnecessary, resulting in attention sinks and limitations in absolute position encoding. The StableMask technique introduces upper-triangular "pseudo-attention" slots with a decaying mask ratio to both resolve the disproportional-attention issue and recover absolute position indices in a parameter-free, extrapolation-friendly manner (Yin et al., 2024).

5. Training Schemes, Tuning, and Representation Learning

Decoder-only transformers are universally trained via the causal language modeling objective:

Lcausal(θ)=t=1TlogP(xtx<t;θ)L_{\text{causal}}(\theta) = -\sum_{t=1}^T \log P(x_t \mid x_{<t}; \theta)

Pretraining with large-scale, diverse data in this regime enables both strong in-context learning and zero/few-shot transfer (Zhang et al., 30 Oct 2025).

  • Instruction Tuning: Fine-tuning on instruction datasets (e.g., FLAN) using standard next-token loss further closes the quality gap with encoder-decoder models on downstream tasks. Optional bidirectional input attention (+BiAttn) improves few/zero-shot performance but is not intrinsic to the decoder-only paradigm.
  • Universal Text Encoders via LLM2Vec: Decoder-only LMs may be "unlocked" as strong universal encoders using a three-step procedure: (1) re-enable bidirectional attention, (2) masked next-token prediction (MNTP), and (3) unsupervised contrastive learning (SimCSE) (BehnamGhader et al., 2024). This transformation yields state-of-the-art performance on text embedding benchmarks, outperforming typical encoder-only models even without supervised adaptation.
  • Uncertainty Calibration and Selective Generation: Final-layer extension architectures such as SDM-LMs augment base LMs with a small SDM head (1-D convolutional + linear), quantifying similarity, distance, and magnitude of activations relative to training support vectors (Schmaltz, 30 Oct 2025). This approach admits a substantially higher fraction of in-distribution prompts while maintaining high accuracy (0.95\geq0.95), sharply reducing over-abstention compared to baseline contrastive fine-tuning.

6. Adaptation, Robustness, and Data-Driven Interventions

Recent studies implement synthetic data interventions (SDI) to control undesirable behaviors such as sycophancy—model agreement with erroneous user claims resulting from RLHF fine-tuning (Wang, 2024). SDI augments training with paraphrased, adversarial, and authority-framed examples. Empirically, SDI increases accuracy (85% \to 91%), decreases sycophancy rates (7% \to 5%), but also mildly lowers response helpfulness scores. The results validate that robust LLMs can be obtained purely via distributional augmentation without altering core Transformer layers.

Layer-skipping and early exit methods for dynamic depth exhibit different robustness characteristics: pre-trained decoder-only models are notably more stable under deterministic layer skipping than under early exit, as the former preserves the functional role of residual paths (Glavas et al., 2024). Sequence-level controllers for dynamic computation are orders of magnitude more effective than per-token gating with current architectures.

7. Trade-offs, Limitations, and Field-Wide Implications

  • Efficiency vs. Expressivity: Decoder-only models dominate the compute-optimal scaling frontier for pretraining losses, yet suffer from increased inference cost (\sim1.4–2× per-token flops) and steeper attention locality decay, limiting context-length extrapolation relative to encoder-decoder models (Zhang et al., 30 Oct 2025).
  • Model Design: The optimal configuration of depth vs. width and data size for a target FLOP budget is highly task- and domain-dependent; width scaling is preferable when maximizing GPU throughput in the low multiple regime (Caillaut et al., 2024).
  • Architectural Simplicity: The conceptual simplicity and homogeneous stack of decoder-only models facilitate adaptation, progressive dynamic-routing, compression, and integration of efficient attention mechanisms, explaining their adoption as the de facto LLM backbone.
  • Theoretical Guarantees: The B-machine equivalence and Turing completeness of the architecture underscore its universality; however, practical limitations such as the inability to freely overwrite past memory, and potential parameter inefficiencies, may motivate future designs that hybridize causal and editable memory paradigms (Roberts, 2023).

Table: Summary of Optimization and Extension Strategies

Technique Efficiency Impact Preservation of Accuracy
YOCO Reduces KV cache \sim1/L Matches or slightly outperforms std. Transformer on downstream
Layer Skipping (oracle ULS) Reduces avg. depth to 23% No ROUGE drop with per-sequence control
Context Compression (Dodo) 10–20×\times context comp. \sim99% BLEU in autoencoding
p-gpt/lc-gpt/cc-gpt \sim36% param. reduction <2%<2\% perplexity increase
StableMask Fixes attention sinks, improves extrapolation Reduces downstream PPL, recovers absolute position
LLM2Vec Unlocks encoding power New SOTA on public MTEB benchmarks
SDM-LM Lowers abstentions, improves calibration Maintains 0.95\geq 0.95 admitted accuracy
SDI (sycophancy) Reduces sycophancy Increases factual accuracy

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decoder-Only Transformer Language Models.