Decoder-Only Transformers

Updated 19 February 2026

Decoder-only Transformers are a neural architecture defined by a single stack of masked self-attention and feed-forward layers operating in a strictly autoregressive, left-to-right manner.
They eliminate explicit encoder modules by enforcing causal masking, which confines each token’s attention to preceding tokens only, ensuring strict temporal order.
Variants like p-gpt, YOCO, and Transformer-VQ enhance efficiency and memory management, enabling scalability for diverse, high-performance generative tasks.

A decoder-only Transformer is a neural architecture that generates output sequences by stacking layers of masked multi-head self-attention and position-wise feed-forward networks, operating in a strictly left-to-right autoregressive fashion. The model does not utilize an explicit encoder subnetwork or cross-attention modules. Instead, all information propagation, context integration, and representation transformation are executed within a single monolithic stack of decoder blocks, each employing a causal (triangular) attention mask to restrict tokens to access only their own previous context. This causal structure underpins the autoregressive language modeling regime foundational to GPT-like LLMs and diverse generative tasks in natural language processing, vision, and beyond.

1. Architectural Foundations and Variants

The canonical decoder-only Transformer builds upon the original Transformer decoder module: each block contains masked multi-head self-attention (MHSA) and a position-wise feed-forward network (FFN), with layer normalization and residual connections. The input to each layer consists of a sum of token and positional embeddings. The core difference from an encoder-decoder architecture is the absence of a cross-attention sub-layer and the exclusive use of lower-triangular masks within self-attention, enforcing strict causality.

Formally, for an input sequence of $n$ token embeddings $X = [x_1, ..., x_n]$ , each block computes

$Q = XW^Q,\quad K = XW^K,\quad V = XW^V$

$\mathrm{head}_h = \mathrm{softmax}\left( \frac{Q_h K_h^\top}{\sqrt{d_k}} + M \right) V_h$

where $M$ is a causal mask (zero for $j \leq i$ , $-\infty$ otherwise for each position $i$ ). Outputs of all heads are concatenated and projected, followed by a residual pathway and processed through the FFN:

$\mathrm{FFN}(x) = \mathrm{GELU}( x W_1 + b_1 ) W_2 + b_2$

The design is highly modular, facilitating stacks of $N$ such blocks. Several architectural variants further adapt and extend the standard block:

Parallel GPT (p-gpt): Two parallel streams of sub-decoders, merge outputs by learned weights.
Linear and Conv Compressed GPT: Use intermediate linear or convolutional downsampling between block groups to reduce parameter count and sequence length, respectively.
Decoder–Decoder Architectures (YOCO): Split the stack into self-decoder and cross-decoder halves, the latter attending only to a global cache of representations, thus reducing KV-cache proliferation (Sun et al., 2024).

These variants maintain generation quality while improving efficiency, memory, or parallelism (Suresh et al., 2024, Sun et al., 2024).

2. Theoretical Expressivity and Universality

Decoder-only Transformers are proven to be Turing complete under rational weight and infinite-precision assumptions. The key elements of the proof are:

The architecture can simulate any rational-weight RNN by encoding both prior state and new input tokens in parallel via attention and FFN sub-networks.
With an appropriate choice of embedding and model dimension ( $X = [x_1, ..., x_n]$ 0, where $X = [x_1, ..., x_n]$ 1 is the embedding rank), all operations required for universal computation (state transition, input presentation, and halting detection) can be realized in a single-layer, single-head decoder with residual connections and hardmax attention.
The architecture is computationally analogous to Hao Wang’s causal B-machine which reads any of its own history but only writes forward sequentially (Roberts, 2023).

However, the proof assumes hardmax attention and infinite precision, and does not address the efficiency or trainability under practical constraints.

3. Efficiency, Memory, and Dynamic Computation

Decoder-only Transformers are computationally costly at large context lengths due to the quadratic scaling of attention and the linear scaling of KV-cache with sequence length and layer count.

Modern research addresses these costs in multiple ways:

Dynamic Layer Skipping and Early Exiting: Empirical studies show that uniform per-sequence layer skipping is significantly more robust than early exiting. Lightweight per-token controllers (e.g. single linear layers with Gumbel-Softmax gates) cannot dynamically exploit token-level difficulty due to the inaccessibility of relevant information in hidden states. Oracle per-sequence allocation achieves full-model quality while using only ≈23% of layers on average (Glavas et al., 2024).
YOCO and Single-Shot Caching: Methods such as YOCO eliminate per-layer, per-token KV-caching in the latter half of the model by projecting the output of an initial self-decoder into a set of global caches for all remaining cross-attention layers. This reduces memory and substantially speeds up prefill, especially at very long context (Sun et al., 2024).
Key-Value Cache Compression: Techniques such as TOVA (Token Omission Via Attention) compress the KV-cache by pruning low-attention states in a training-free, token-optimal manner, converting the transformer into a bounded multi-state RNN and achieving 4–5× memory savings with negligible drop in generative performance (Oren et al., 2024).
Linear-Time Self-Attention: Transformer-VQ replaces pairwise attention with blockwise vector-quantized softmax attention and codebook-based compressive summaries, achieving true linear throughput beyond 100k tokens with maintained quality (Lingle, 2023).
Direct Multi-Token Decoding (DMTD): Exploiting the natural layer specialization in trained LLMs, only late "decoding" layers are run for blocks of output tokens after one full encoding pass, reducing average layers per token and approximately doubling throughput with minimal accuracy loss (Luo et al., 13 Oct 2025).

Combinations of these strategies yield inference-time scaling suitable for billion-parameter models and context lengths exceeding 500k.

4. Foundations in Recurrent and Memory-Augmented Networks

Decoder-only Transformers can be interpreted as unbounded multi-state RNNs: each layer’s key-value cache acts as a growing state matrix, storing the representations of all past tokens. When the cache is bounded (via truncation or dynamic compression), the network transitions to a bounded multi-state RNN. This view illuminates the tradeoff between computational universality and tractable hardware implementation (Oren et al., 2024).

Further, the causal masking and autoregressive forward path induces a strict temporal ordering that aligns closely with classical left-to-right RNN sequence models, but with exponentially larger effective state due to multi-headed self-attention. This vantage aligns decoder-only Transformers with the theoretical lineage of B-machines and universal sequential computation (Roberts, 2023).

5. Applications Across Modalities and Domains

Although originally developed for text, decoder-only architectures now underpin state-of-the-art models in domains including:

Natural Language Generation and Summarization: Used in OPT, LLaMA, and instruction-tuned variants for NLG, with empirical studies confirming robustness to aggressive layer-skipping (Glavas et al., 2024).
Vision: Decoder-only Visual Generation in random order (RandAR) enables AR image generation, inpainting, outpainting, and scalable parallel decoding, outperforming raster-order models in both speed and certain capability metrics (Pang et al., 2024).
Structured and Multimodal Data: Causal transformers synthesize privacy-preserving and utility-preserving synthetic EHRs (SynEHRgy) by autoregressively modeling highly structured, mixed-type records using fine-grained tokenization (Karami et al., 2024); similarly, DTrOCR achieves state-of-the-art OCR as a pure decoder-only network (Fujitake, 2023).
Machine Translation: The Decoder-Only Streaming Transformer for SiMT enables state-of-the-art simultaneous translation using a streaming self-attention module and dual positional streams for source and target, demonstrating strong BLEU and latency metrics (Guo et al., 2024).
Tracking and Structured Reasoning: Decoder-only, DETR-style architectures (e.g., DecoderTrack, DecoderTrack+) for object tracking employ only a decoder block over convolutional extracted features, with optional external memory to track long-term cues and varying attention refinements (Pan et al., 2023).
Scalable Conditional Computation: TreeCoders generalize the linear stack to a sparse k-ary tree of decoder blocks with externally trained selectors, exponentially expanding model width while ensuring only $X = [x_1, ..., x_n]$ 2 blocks are active per run, yielding improved conditional computation and scaling (D'Istria et al., 2024).

6. Limitations, Expressivity, and Recent Extensions

While decoder-only Transformers are universal under idealized assumptions, in realistic hardware and data regimes several limitations and open challenges persist:

Expressivity Constraints: For certain sequence-to-token functions requiring super-quadratic computation or arbitrary prefix recomputation, encoder-only models with full self-attention can achieve tasks of strictly greater complexity (e.g., Count3), whereas decoder-only models cannot, unless depth increases with input length. Thus, the function classes expressible by decoder-only and encoder-only architectures are fundamentally incomparable (Ewer et al., 2024).
Causal Masking Limitations: Standard causal masking and RPE-based positional encoding restrict the ability to represent absolute position, impairing universal approximation for position-sensitive functions. StableMask, a parameter-free modified masking mechanism, circumvents this restriction and empirically yields improved perplexity, scaling, and extrapolation (Yin et al., 2024).
Per-token Adaptive Computation: Lightweight per-token skip controllers are unable to use hidden state information beyond simple average skip rates (Glavas et al., 2024).
Memory and Efficiency Bottlenecks: Despite improvements, the need to retain large caches and execute multi-block prefill can still dominate decoding costs at trillion-token scale or high batch throughput.
Hardware Implementation Tradeoffs: Specialized variants (YOCO, Transformer-VQ, TOVA) and parallel computation schemes (p-gpt, Conv/LinearGPT, TreeCoders) each require architectural and training adaptations. Their performance may be platform or batch-size-dependent, and scaling to the largest LLMs continues to expose new bottlenecks (Suresh et al., 2024, Sun et al., 2024, Lingle, 2023).

Research is active on further dynamic computation strategies, blockwise/conditional execution, memory system optimizations, attention kernel modifications, and integration with retrieval and multimodal modules.

7. Future Directions and Open Research

Decoder-only Transformers remain central to the evolution of LLMs and autoregressive sequence modeling. Ongoing and anticipated areas of exploration include:

Advanced dynamic inference and sparse activation schemes: Tree-structured (treecoders) and mixture-of-experts (MoE) routing, per-sequence conditional computation, and dynamic layer activation (D'Istria et al., 2024, Glavas et al., 2024).
Further memory efficiency gains: Enhanced compressive KV-caching, learned token selection, and integration with streaming and chunk-wise approaches.
Universal and multimodal models: Extending vision–language integration within a pure causal decoding regime, exploiting patch and vector-quantized tokenization (Karami et al., 2024, Pang et al., 2024).
Overcoming limitations in absolute position encoding and extrapolation: Parameter-free masking, hybrid positional encodings (StableMask, adaptive APE/RPE blends) (Yin et al., 2024).
Expressivity and theoretical boundaries: Deeper understanding of architectural trade-offs, function class boundaries, and the minimal sufficient mechanisms for universal computation and super-quadratic token-level operations (Ewer et al., 2024).
Hardware/software co-design: Kernel fusion, batch/sequence parallelism, and cache optimization tailored to emerging accelerator architectures (Sun et al., 2024, Suresh et al., 2024).

Decoder-only models continue to unify and generalize a wide spectrum of generative modeling techniques. Their practical and theoretical boundaries are, and will remain, a focus of intense investigation, informing both the scalable engineering of next-generation LLMs and the deeper theory of sequence computation.