Decoder-Only Transformer

Updated 19 February 2026

Decoder-only transformers are neural architectures with unidirectional masked self-attention and feed-forward layers that generate outputs autoregressively.
They use causal masking to ensure each token only attends to previous tokens, enabling Turing-completeness and streamlined sequential decision-making.
These models underpin leading systems like GPT-x and are applied in language generation, vision OCR, speech recognition, and recommendation tasks.

A decoder-only transformer is a neural architecture comprising a stack of unidirectional (causally masked) self-attention and feed-forward layers, without any separate encoder or cross-attention mechanisms. At each timestep, the model processes a growing prefix of tokens, updating internal “key” and “value” caches for self-attention and generating outputs autoregressively. This paradigm underlies models such as GPT-x, DecoderTrack, and variants used extensively in language, vision, speech, recommendation, and sequential decision-making domains.

1. Formal Definition and Theoretical Foundations

A decoder-only transformer consists of $N$ identical blocks, each comprising masked multi-head self-attention and position-wise feed-forward sublayers, linked by residual connections and layer normalization. The masking ensures that each position $i$ can attend only to positions $j \leq i$ . For token embeddings $X \in \mathbb{R}^{L \times d}$ , the attention mechanism is defined:

$\text{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + M\right) V$

where $Q = X W^Q$ , $K = X W^K$ , $V = X W^V$ , and $M$ is the causal mask.

Roberts proved Turing-completeness of the decoder-only transformer under mild rational-weight and position-encoding assumptions, casting it as a causal “B-machine” capable of universal computation with an appropriate embedding and hard-attention mechanism (Roberts, 2023). For a single block, the self-attention calculation can be reformulated as a two-layer RNN, with one recurrence for the log-sum-exp accumulator and another for the weighted running sum (Zhang et al., 2024);

$g_i = \log(e^{g_{i-1}} + e^{q_n \cdot k_i}), \quad f_i = v_i \sigma(q_n \cdot k_i - g_{i-1}) + f_{i-1} \sigma(g_{i-1} - q_n \cdot k_i)$

where $i$ 0 is the logistic sigmoid and $i$ 1 (the attention output) is given by $i$ 2. More generally, the architecture is an instance of an unbounded Multi-State RNN, with the key-value (KV) cache serving as the growing hidden state (Oren et al., 2024).

2. Architectural Variants and Masking Schemes

Several architectural innovations and refinements have been proposed:

Standard Decoder-Only Stack: Repeats masked multi-head self-attention $i$ 3 FFN for $i$ 4 blocks, using learned or relative positional encodings (RoPE, ALiBi, T5).
Parallel and Compressed Variants: ParallelGPT splits the stack into parallel branches; LinearGPT and ConvGPT compress the width of the layers linearly or via convolution, yielding smaller, faster models with comparable generation quality (Suresh et al., 2024).
StableMask: The standard causal mask requires non-zero attention to each prior token and row sum $i$ 5, even when self-attention is undesired. StableMask relaxes this constraint using pseudo-attention slots, allowing attention mass to be dumped outside the actual prefix, breaking the right-stochasticity and permitting universal absolute position encoding. This modification alleviates excessive attention bias and supports efficient extrapolation (Yin et al., 2024).

$i$ 6

where $i$ 7 is the causal indicator and $i$ 8 defines the pseudo slots.

Memory-Efficient Architectures: YOCO decouples the stack into a memory-efficient self-decoder (constant-size KV cache) and a cross-decoder (global cross-attention on a single cached memory), achieving linear ( $i$ 9) memory while recovering full expressivity and global context (Sun et al., 2024).

3. Computational and Memory Characteristics

In standard form, the main computational and memory bottleneck is quadratic: $j \leq i$ 0 compute and $j \leq i$ 1 cache memory for sequence length $j \leq i$ 2, layers $j \leq i$ 3, dimension $j \leq i$ 4.

Methods for mitigation:

KV Cache Compression: TOVA dynamically evicts the key-value pair in the cache with lowest attention, permitting a bounded cache size $j \leq i$ 5 and transforming the model into a finite Multi-State RNN. TOVA closely matches full-context model performance across language modeling, summarization, QA, and generation benchmarks at $j \leq i$ 6 the memory footprint, yielding $j \leq i$ 7– $j \leq i$ 8 throughput improvement (Oren et al., 2024).
Vector-Quantized Attention: Transformer-VQ factorizes dense self-attention using a vector-quantized key codebook, permitting blockwise linear-time ( $j \leq i$ 9) computation while maintaining high generation quality (Lingle, 2023).

4. Applications Across Domains

Decoder-only transformers have become the backbone in a wide array of domains:

Language Modeling and Generation: The GPT-x series, as well as model trends in LLMs, use decoder-only stacks for unmatched performance in domain-general generation, code synthesis, long-context reasoning, and retrieval-augmented generation.
Vision and Multimodal: DTrOCR and GraDeT-HTR demonstrate decoder-only architectures applied to vision (OCR, handwritten text recognition) by mapping images to patch tokens, using the transformer to autoregressively produce character or grapheme sequences; these approaches outperform previous encoder-decoder paradigms in English, Chinese, and Bengali (Fujitake, 2023, Hasan et al., 22 Sep 2025).
Speech: Unified spoken LLMs discretize acoustic features into tokens and train decoder-only stacks for speech recognition, translation, and captioning, with losses tailored to the noise/discreteness of speech units (e.g., Smoothed Label Distillation, SLD) (Chen et al., 2023).
Recommendation Systems: CADET applies a decoder-only transformer to sequential CTR prediction with major engineering advances—context-conditioned heads, gated attention, timestamp-based RoPE, and custom masking—to serve high throughput online recommendation at LinkedIn, outperforming hybrid DLRM+encoder benchmarks (Pardoe et al., 11 Feb 2026).
Computer Vision Detection and Tracking: D²ETR removes the encoder from DETR, using decoder-only cross-scale attention for efficient object detection competitive with or exceeding encoder-decoder variants at a fraction of the cost (Lin et al., 2022). DecoderTrack uses a decoder-only transformer for multi-object tracking with further architectural and memory optimizations [(Pan et al., 2023)*; see note].
Simultaneous Translation: The Decoder-only Streaming Transformer manipulates source/target prefix concatenation and customized positional encoding to achieve competitive BLEU/latency trade-offs in translation streaming via its Streaming Self-Attention (SSA) mechanism (Guo et al., 2024).

Domain	Application	Key Techniques / Benefits
Language	LLMs, code gen	GPT-2/3/4, autoregressive prediction
Vision	OCR, detection	Patch tokenization, decoder-only object queries, D²ETR
Speech	ASR, multitask	Discrete tokenization, task-adaptive loss (SLD)
Recommendation	CTR prediction	Context-tower heads, timestamp RoPE, high-throughput kernels
Multimodal	OCR, HTR, translation	End-to-end image/patch to text, graph-aware tokenization

*(Pan et al., 2023): Summary info based on reference to DecoderTrack, abstract, and context notes.

5. Advanced Inference and Training Strategies

The decoder-only paradigm supports a range of inference improvements:

Direct Multi-Token Decoding (DMTD): Exploits specialization of late transformer layers (“decoding layers”) to produce multiple tokens per forward pass, amortizing the early/middle layers across several tokens. Achieves up to $X \in \mathbb{R}^{L \times d}$ 0 speedup with minor quality loss, parameterized by cycle length $X \in \mathbb{R}^{L \times d}$ 1 and decoding layer count $X \in \mathbb{R}^{L \times d}$ 2 (Luo et al., 13 Oct 2025).

$X \in \mathbb{R}^{L \times d}$ 3

Streaming and Policy-Learning: DST for simultaneous translation continually appends new source tokens to the prefix buffer, using separate source/target positional encodings to avoid recomputation, and leverages the SSA mechanism to dynamically decide “READ” vs “WRITE” at each step (see BLEU vs average lagging, (Guo et al., 2024)).
Robustness Certification: ARC-Tran leverages the equivalence between a one-layer decoder-only transformer and a two-layer RNN to enable interval-bound propagation and abstract interpretation for certifying robustness to arbitrary input perturbations (Zhang et al., 2024). This approach subsumes prior LSTM-centric verification methods and handles non-length-preserving string transformations.

6. Limitations, Open Challenges, and Future Directions

While decoder-only transformers are theoretically universal and empirically dominant, limitations remain:

Memory and Context Bottlenecks: Unbounded KV caches cannot be realized in hardware, so practical deployments require well-designed cache compression, windowing, or architectural variants (see TOVA, YOCO).
Attention and Masking Biases: Causal softmax mandates nonzero attention to earlier tokens, causing spurious allocation especially at short prefixes; masking or pseudo-mass (e.g., StableMask) can relieve this bias.
Representation Universality: Standard RoPE and ALiBi encode only relative position, limiting universality on position-critical tasks; architectural remedies such as pseudo-slots (StableMask) or timestamp-based RoPE address this.
Specialization Effects: Layer specialization means late layers dominate next-token prediction; inference strategies such as DMTD leverage this for faster decoding at a controllable quality-speed trade-off.
Parameter Efficiency vs. Turing Universality: While formal models are Turing-complete, practical architectures are vastly overparameterized relative to the theoretical minimum; closing this gap is an ongoing research direction (Roberts, 2023).

Potential avenues include learned or dynamic memory agents, further compressive or content-adaptive attention mechanisms, improved absolute position modeling, and rigorous understanding of the interface between architecture, optimization, and in-situ algorithmic generalization.

References

Roberts, D. (2023). "How Powerful are Decoder-Only Transformer Neural Models?" (Roberts, 2023)
Schwartz et al. (2024). "Transformers are Multi-State RNNs" (Oren et al., 2024)
Zhang et al. (2024). "A One-Layer Decoder-Only Transformer is a Two-Layer RNN: With an Application to Certified Robustness" (Zhang et al., 2024)
Wang & Arai (2024). "StableMask: Refining Causal Masking in Decoder-only Transformer" (Yin et al., 2024)
Geng et al. (2024). "You Only Cache Once: Decoder-Decoder Architectures for LLMs" (Sun et al., 2024)
Roy et al. (2024). "Towards smaller, faster decoder-only transformers: Architectural variants and their implications" (Suresh et al., 2024)
Suzuki et al. (2023). "Transformer-VQ: Linear-Time Transformers via Vector Quantization" (Lingle, 2023)
Chen et al. (2024). "Decoder-only Streaming Transformer for Simultaneous Translation" (Guo et al., 2024)
LinkedIn AI (2026). "CADET: Context-Conditioned Ads CTR Prediction With a Decoder-Only Transformer" (Pardoe et al., 11 Feb 2026)
Fujitake et al. (2023). "DTrOCR: Decoder-only Transformer for Optical Character Recognition" (Fujitake, 2023)
Haque et al. (2025). "GraDeT-HTR: A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer" (Hasan et al., 22 Sep 2025)
Yu et al. (2023). "Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR" (Chen et al., 2023)