Decoder-Only Transformers

Updated 16 December 2025

Decoder-Only Transformers are autoregressive models that employ causal self-attention and feedforward layers without an encoder, enabling efficient token-by-token generation.
Advances like StableMask, YOCO, and dynamic layer selection optimize computation and enable scaling to million-token contexts with reduced inference costs.
Research demonstrates these models achieve Turing completeness and are adaptable to multilingual and multimodal tasks, offering robust performance across diverse applications.

Decoder-only Transformers are autoregressive sequence models built entirely from stacks of causal self-attention and feedforward layers, omitting any encoder stage or bidirectional attention. This architecture, first exemplified by GPT, has become the dominant paradigm for LLMs and is rapidly extending to domains such as vision, control, and structured prediction. Their unidirectional, token-by-token generation is matched to next-token prediction objectives, supporting highly efficient pretraining and inference workflows. Recent research explores their theoretical properties, architectural variants, computational optimizations, and application-driven adaptations.

1. Architectural Principles and Mathematical Formulation

A decoder-only Transformer processes an input sequence $x = (x_1,\ldots,x_T)$ using $L$ blocks; each block comprises multi-head masked self-attention (with causal/upper-triangular masking) followed by residual feed-forward sublayers. The self-attention mechanism for a token $i$ in layer $l$ is: $A = \frac{QK^\top}{\sqrt{d}}$ where $Q, K, V \in \mathbb{R}^{n \times d}$ are learned linear projections of the input embeddings, $A$ is the attention score matrix, and masking enforces causality ( $A_{ij} = -\infty$ for $j>i$ ). With relative positional encoding (RPE), $A$ is either additively ( $A+S$ ) or multiplicatively ( $\tilde Q \tilde K^\top$ ) modified. The output of each layer is: $\tilde{A} = \mathrm{Softmax}(A + M)$

$Z^\ell = \mathrm{LayerNorm}(Z^{\ell-1} + \mathrm{MultiHead}(Z^{\ell-1}))$

$Z^\ell = \mathrm{LayerNorm}(Z^\ell + \mathrm{FFN}(Z^\ell))$

This structure is applied identically across language, vision, and structured prediction, with tokenization and embedding design adapting to each modality (Yin et al., 7 Feb 2024, Fujitake, 2023, Pang et al., 2 Dec 2024).

2. Theoretical Properties and Universality

Decoder-only Transformers have been shown to be Turing complete under reasonable assumptions, such as hardmax attention and sufficient model width to enable "dead space" for encoding both RNN hidden state and input simultaneously. Specifically, a single-layer, single-head decoder-only transformer of width $2d_\mathrm{embed}+3$ can simulate any rational-weight RNN or Wang B-machine, i.e., the architecture's autoregressive constraint (left-to-right, write-once, no overwrites) is as expressive as unrestricted sequential computation (Roberts, 2023).

Formally, input/output equivalence is possible due to:

Shared input/output embedding
Linear agglomeration for packing multiple state vectors
Cascade feed-forward networks for B-machine transition and halting logic

However, the practical challenge is learning the correct transition functions in a high-dimensional, non-compressible embedding space.

3. Advances in Masking, Position Encoding, and Extrapolation

Standard causal masking and RPE introduce two fundamental limitations: "attention sinks" due to the softmax normalization (forced attention spread across context positions), and loss of absolute positional information (relative encodings alone cannot distinguish positions in all-identical sequences). To address these:

StableMask replaces the standard mask with a refined one introducing pseudo-attention values (decay parameter $\gamma$ ) in masked positions, decoupling normalization and providing a monotonic mask ratio $\beta_i$ that encodes absolute position within the visible context:

$\beta_i = \sum_{j\le i} \mathrm{Softmax}_{A_i \cup P_i}(A_{ij})$

Empirical results on WikiText-103, MiniPile, and downstream tasks show StableMask achieves consistent perplexity improvements and more robust extrapolation (Yin et al., 7 Feb 2024).

Relative positional encoding (RPE; ALiBi, RoPE, etc.) remains standard, yet decoupled masking/position innovations enhance universality and extrapolative generalization.

4. Computational Optimization and Scaling

Autoregressive Transformers' core computational bottleneck is the quadratic cost of computing full self-attention over long sequences. Recent contributions target this with:

Transformer-VQ: Exact softmax-based attention in linear time, using vector-quantized keys and a cache of compressed value statistics, reducing compute and memory to $O(T\cdot (S + L)\cdot (d_k + d_v))$ for codebook size $S$ and block length $L$ (Lingle, 2023).
YOCO: Splitting L blocks into a self-decoder (first $L/2$ ) and a cross-decoder (last $L/2$ ), caching global key/value pairs only once and reusing via cross-attention, reducing the standard O( $NLD$ ) cache to O( $ND$ ) and supporting efficient prefill and million-token context with little accuracy loss (Sun et al., 8 May 2024).
Dynamic Layer Selection: Uniform layer skipping and oracle-based per-sequence depth allocation: up to 76.2% of time, tree-based or layer-skipped architectures match or outperform linear stacks, requiring as little as 23.3% of the full inference cost with minimal accuracy degradation (D'Istria et al., 11 Nov 2024, Glavas et al., 26 Oct 2024).
Token Omission Via Attention (TOVA): Greedy cache-compression by discarding the least-attended key-value pair under each query, achieving near-baseline performance with only $1/8$ the original cache size and up to $4.8\times$ greater throughput (Oren et al., 11 Jan 2024).

5. Structural Variants and Specialized Architectures

Recent works depart from monolithic decoder stacks to explore:

TreeCoders: Arranging decoder blocks in a $k$ -ary tree, with each token routed by learned classifiers down a single path (height $h$ ), reducing routine complexity from O( $N$ ) to O( $\log_k N$ ) and activating a small fraction of the network per sequence (D'Istria et al., 11 Nov 2024).
Compressed-depth Models: LinearGPT and ConvGPT compress the hidden-state dimension every few layers, keeping most representation power near the input and reducing parameter count and memory by 36% with negligible loss (Suresh et al., 22 Apr 2024).
One-Layer Equivalences: A one-layer, single-head decoder-only transformer is formally equivalent to a two-layer RNN, enabling efficient abstract interpretation for robustness verification (ARC-Tran) and suggesting RNN-style reasoning for memory and efficiency analyses (Zhang et al., 27 May 2024).

6. Multimodal and Application-Driven Extensions

Decoder-only transformers are increasingly adapted for non-text tasks:

DTrOCR: Decoder-only vision transformer for optical character recognition, concatenating patch embeddings and generated tokens in a flat sequence, leveraging GPT-2 initialization and masked self-attention for both visual and linguistic context, yielding state-of-the-art accuracy on English/Chinese handwritten, printed, and scene text tasks (Fujitake, 2023).
RandAR: Visual generation in arbitrary token orders using interleaved position-instruction tokens, training on random permutations and supporting random-order, inpainting, outpainting, and parallel decoding with linear speedups and no quality loss (Pang et al., 2 Dec 2024).
WiFiGPT: Application of standard LLaMA-based decoder-only Transformers, fine-tuned as next-token numeric regressors for WiFi-CSI/FTM/RSSI telemetry, achieving centimeter-level location accuracy with zero architectural modifications—demonstrating the architecture's expressivity and robustness in continuous regression domains (Bhatia et al., 16 May 2025).

7. Interpretations, Limitations, and Emerging Directions

Decoder-only Transformers unify autoregressive sequence modeling with scalable, highly parallel architectures, yet several design frontiers remain:

Absolute Position Encoding: StableMask and other mask-ratio innovations signal a shift toward richer position signals without resorting to absolute embedding tables (Yin et al., 7 Feb 2024).
Memory and Throughput: Efficient key-value caching (YOCO, Transformer-VQ, TOVA), dynamic computation, and tree routing collectively enable million-token scaling and deployment in memory-constrained environments (Lingle, 2023, Sun et al., 8 May 2024, Oren et al., 11 Jan 2024, D'Istria et al., 11 Nov 2024).
Robustness and Verification: RNN equivalence and abstract interpretation open the door to certifying robustness under unrestricted, length-varying input perturbations (Zhang et al., 27 May 2024).
Sequence Adaptivity: Dynamic layer skipping and per-sample routing provide a path to compute-efficient, context- and task-adaptive inference (Glavas et al., 26 Oct 2024).

A plausible implication is that as sequence lengths, domains, and computational heterogeneity expand, decoder-only architectures will increasingly incorporate mask and routing innovations, continual memory compression, and hybrid computation strategies, while remaining rooted in the autoregressive self-attention backbone.

Key references:

(Yin et al., 7 Feb 2024) (StableMask: causal masking and position encoding)
(D'Istria et al., 11 Nov 2024) (TreeCoders: tree-structured decoders)
(Roberts, 2023) (Turing completeness of decoder-only Transformers)
(Sun et al., 8 May 2024) (YOCO: scalable caching)
(Glavas et al., 26 Oct 2024) (Dynamic layer selection)
(Lingle, 2023) (Transformer-VQ: linear-time exact attention)
(Suresh et al., 22 Apr 2024) (Compressed-depth architectures)
(Oren et al., 11 Jan 2024) (Transformers as multi-state RNNs and cache compression)
(Fujitake, 2023) (DTrOCR: vision decoder-only OCR)
(Pang et al., 2 Dec 2024) (RandAR: random-order visual generation)
(Bhatia et al., 16 May 2025) (WiFiGPT: decoder-only regression for wireless telemetry)
(Zhang et al., 27 May 2024) (RNN equivalence and certified robustness)