Decoder-Only Transformer Models

Updated 12 February 2026

Decoder-only Transformer models are neural architectures that use causal self-attention and feed-forward layers to autoregressively predict tokens.
They integrate enhancements like rotary position embeddings, dynamic layer selection, and context compression to optimize performance and efficiency.
Scaling laws and Turing-completeness results highlight their theoretical universality and practical impact in state-of-the-art language generation.

A decoder-only Transformer LLM is a neural architecture that autoregressively predicts tokens in a sequence using a stack of self-attention and feed-forward layers, each operating under a causal (left-to-right) constraint. This architecture underpins the state-of-the-art generation capabilities of modern LLMs such as GPT, OPT, and their derivatives across diverse domains. The following sections review core architectural principles, mathematical properties, inference and efficiency optimizations, scaling behaviors, and research directions, drawing extensively from recent arXiv literature.

1. Core Architecture and Mathematical Formalism

A standard decoder-only Transformer consists of L sequentially stacked blocks, each containing multi-head self-attention (MHSA) with causal masking, feed-forward sublayers (often with SwiGLU/KV variants), and normalization layers (typically RMSNorm or LayerNorm in a pre-norm configuration). The model processes an input sequence $x_{1:T}$ as follows:

The initial embedding $e(x_{1:T})$ encodes tokens and positions (usually via rotary position encodings in modern LLMs).
Each decoder block $d^l$ updates hidden states using

$h^l = d^l(h^{l-1}), \quad \text{where} \quad h^0 = e(x_{1:T}).$

Only left-context (tokens $x_{<t}$ ) is available at position $t$ , enforced by a causal mask $M_{ij}^{\text{causal}} = 0$ if $j \leq i$ , $-\infty$ else.

At the output, a softmax classifier predicts the next token:

$p(x_t | x_{<t}) = \mathrm{softmax}(W^\top h_T^L).$

This uni-directional setup ensures token generation is strictly autoregressive. All architectural weights (including embeddings) are typically tied for parameter efficiency (Zhang et al., 30 Oct 2025).

Recent works adopt enhancement strategies:

Rotary position embeddings facilitate context-length extrapolation and more stable optimization for long-range dependencies.
SwiGLU or similar gated feed-forward networks improve expressivity and gradient flow.
Grouped-query and flash attention variants are used for efficiency at scale.

2. Theoretical Expressivity and Universality

Under the assumptions of arbitrary precision, rational weights, and hard (non-softmax) attention, a single-layer, single-head decoder-only transformer can simulate any RNN and is thus Turing complete, as formalized in (Roberts, 2023). The core proof relies on embedding both the input and the RNN hidden state into the high-dimensional transformer state, using attention to select relevant context and a feedforward network for state updates. The required property is that $d_{\text{model}} > 2 d_{\text{embed}}$ to allow disjoint storage of input and hidden state.

The decoder-only model is structurally equivalent to a causal B machine (Wang): a single-tape Turing machine that can read any prior cell but only append new outputs. This formal result indicates that, in principle, even restricted to causal self-attention, the architecture is computationally universal provided sufficient dimension and model depth. In practice, the need for parameter efficiency and single-pass computation motivates deep, multi-head stacks instead of recursive construction.

3. Scaling Laws and Model Capacity

Empirical investigations demonstrate that decoder-only LLMs exhibit classic power-law and Chinchilla-style scaling in loss as a function of parameter and data size (Caillaut et al., 2024, Zhang et al., 30 Oct 2025). The cross-entropy loss $L$ on test data follows:

$L(N) = \alpha N^{-p} + \beta$

where $N$ is parameter count. For joint scaling with data:

$L(N, D) = E + \frac{a}{N^\alpha} + \frac{b}{D^\beta}$

where $D$ is number of training examples.

Critically, for both general and multilingual machine translation tasks, scaling exponents are nearly identical to those of encoder-decoder LMs. However, scaling laws do not reliably extrapolate beyond the largest observed model/data regime or transfer across domains without adaptation (Caillaut et al., 2024).

Empirical studies find that depth or width increases contribute similarly to loss reduction for fixed FLOP budgets, but width increases yield greater throughput on GPUs, whereas depth increases may be preferable for certain parallelization constraints.

4. Inference, Efficiency, and Architectural Variants

4.1 Memory and Speed Optimizations

The critical efficiency challenge for decoder-only models is the quadratic time and memory complexity of self-attention, particularly at long context lengths. To address this:

YOCO ("You Only Cache Once"): Splits the stack into a self-decoder (with efficient attention such as sliding-window or gated retention) and a cross-decoder that attends exclusively to the global key-value cache generated by the self-decoder (Sun et al., 2024). This yields an $\mathcal{O}(N d)$ vs. $\mathcal{O}(L N d)$ cache requirement and dramatically accelerates prefill, scaling to million-token contexts with high retrieval accuracy.
Dynamic Layer Selection: Dynamic inference methods such as layer skipping and early exiting can dramatically reduce inference cost (Glavas et al., 2024). Layer skipping, especially with uniform schedules or per-sequence allocation via an oracle controller, preserves model accuracy while using only a fraction of the layers. Hidden-state–based controllers offer little advantage over token-agnostic ones in current systems.
Context Compression: Dodo introduces dynamic contextual compression, retaining only selected "nugget" hidden states past a designated layer (Qin et al., 2023). This method achieves nearly lossless autoencoding at 10–20 $\times$ compression and substantially reduces both memory and compute costs for long-context applications.

4.2 Parameter and Architectural Compression

Variants designed for smaller/faster inference include:

ParallelGPT (p-gpt): Splits layer stacks into parallel subcomponents with learnable fusion, trading small parameter increases for parallelization and dynamic inference flexibility.
LinearlyCompressedGPT (lc-gpt) and ConvCompressedGPT (cc-gpt): Interleave progressive hidden-size halving (via linear/conv layers) between block groups, yielding 36% parameter reduction and ∼20% training speedup with minimal perplexity degradation (Suresh et al., 2024).

4.3 Improved Masking and Position Encoding

Standard causal masking enforces attention to all prior tokens, even when unnecessary, resulting in attention sinks and limitations in absolute position encoding. The StableMask technique introduces upper-triangular "pseudo-attention" slots with a decaying mask ratio to both resolve the disproportional-attention issue and recover absolute position indices in a parameter-free, extrapolation-friendly manner (Yin et al., 2024).

5. Training Schemes, Tuning, and Representation Learning

Decoder-only transformers are universally trained via the causal language modeling objective:

$L_{\text{causal}}(\theta) = -\sum_{t=1}^T \log P(x_t \mid x_{<t}; \theta)$

Pretraining with large-scale, diverse data in this regime enables both strong in-context learning and zero/few-shot transfer (Zhang et al., 30 Oct 2025).

Instruction Tuning: Fine-tuning on instruction datasets (e.g., FLAN) using standard next-token loss further closes the quality gap with encoder-decoder models on downstream tasks. Optional bidirectional input attention (+BiAttn) improves few/zero-shot performance but is not intrinsic to the decoder-only paradigm.
Universal Text Encoders via LLM2Vec: Decoder-only LMs may be "unlocked" as strong universal encoders using a three-step procedure: (1) re-enable bidirectional attention, (2) masked next-token prediction (MNTP), and (3) unsupervised contrastive learning (SimCSE) (BehnamGhader et al., 2024). This transformation yields state-of-the-art performance on text embedding benchmarks, outperforming typical encoder-only models even without supervised adaptation.
Uncertainty Calibration and Selective Generation: Final-layer extension architectures such as SDM-LMs augment base LMs with a small SDM head (1-D convolutional + linear), quantifying similarity, distance, and magnitude of activations relative to training support vectors (Schmaltz, 30 Oct 2025). This approach admits a substantially higher fraction of in-distribution prompts while maintaining high accuracy ( $\geq0.95$ ), sharply reducing over-abstention compared to baseline contrastive fine-tuning.

6. Adaptation, Robustness, and Data-Driven Interventions

Recent studies implement synthetic data interventions (SDI) to control undesirable behaviors such as sycophancy—model agreement with erroneous user claims resulting from RLHF fine-tuning (Wang, 2024). SDI augments training with paraphrased, adversarial, and authority-framed examples. Empirically, SDI increases accuracy (85% $\to$ 91%), decreases sycophancy rates (7% $\to$ 5%), but also mildly lowers response helpfulness scores. The results validate that robust LLMs can be obtained purely via distributional augmentation without altering core Transformer layers.

Layer-skipping and early exit methods for dynamic depth exhibit different robustness characteristics: pre-trained decoder-only models are notably more stable under deterministic layer skipping than under early exit, as the former preserves the functional role of residual paths (Glavas et al., 2024). Sequence-level controllers for dynamic computation are orders of magnitude more effective than per-token gating with current architectures.

7. Trade-offs, Limitations, and Field-Wide Implications

Efficiency vs. Expressivity: Decoder-only models dominate the compute-optimal scaling frontier for pretraining losses, yet suffer from increased inference cost ( $\sim$ 1.4–2× per-token flops) and steeper attention locality decay, limiting context-length extrapolation relative to encoder-decoder models (Zhang et al., 30 Oct 2025).
Model Design: The optimal configuration of depth vs. width and data size for a target FLOP budget is highly task- and domain-dependent; width scaling is preferable when maximizing GPU throughput in the low multiple regime (Caillaut et al., 2024).
Architectural Simplicity: The conceptual simplicity and homogeneous stack of decoder-only models facilitate adaptation, progressive dynamic-routing, compression, and integration of efficient attention mechanisms, explaining their adoption as the de facto LLM backbone.
Theoretical Guarantees: The B-machine equivalence and Turing completeness of the architecture underscore its universality; however, practical limitations such as the inability to freely overwrite past memory, and potential parameter inefficiencies, may motivate future designs that hybridize causal and editable memory paradigms (Roberts, 2023).

Table: Summary of Optimization and Extension Strategies

Technique	Efficiency Impact	Preservation of Accuracy
YOCO	Reduces KV cache $\sim$ 1/L	Matches or slightly outperforms std. Transformer on downstream
Layer Skipping (oracle ULS)	Reduces avg. depth to 23%	No ROUGE drop with per-sequence control
Context Compression (Dodo)	10–20 $\times$ context comp.	$\sim$ 99% BLEU in autoencoding
p-gpt/lc-gpt/cc-gpt	$\sim$ 36% param. reduction	$<2\%$ perplexity increase
StableMask	Fixes attention sinks, improves extrapolation	Reduces downstream PPL, recovers absolute position
LLM2Vec	Unlocks encoding power	New SOTA on public MTEB benchmarks
SDM-LM	Lowers abstentions, improves calibration	Maintains $\geq 0.95$ admitted accuracy
SDI (sycophancy)	Reduces sycophancy	Increases factual accuracy

References

"Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder LLM" (Zhang et al., 30 Oct 2025)
"StableMask: Refining Causal Masking in Decoder-only Transformer" (Yin et al., 2024)
"You Only Cache Once: Decoder-Decoder Architectures for LLMs" (Sun et al., 2024)
"Dynamic layer selection in decoder-only transformers" (Glavas et al., 2024)
"Dodo: Dynamic Contextual Compression for Decoder-only LMs" (Qin et al., 2023)
"Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task" (Caillaut et al., 2024)
"LLM2Vec: LLMs Are Secretly Powerful Text Encoders" (BehnamGhader et al., 2024)
"How Powerful are Decoder-Only Transformer Neural Models?" (Roberts, 2023)
"Towards smaller, faster decoder-only transformers: Architectural variants and their implications" (Suresh et al., 2024)
"Similarity-Distance-Magnitude LLMs" (Schmaltz, 30 Oct 2025)
"Mitigating Sycophancy in Decoder-Only Transformer Architectures: Synthetic Data Intervention" (Wang, 2024)