Decoder-Only Transformer Architectures
- Decoder-only Transformer architectures are neural sequence models that utilize masked self-attention and feed-forward layers to generate outputs in an autoregressive manner.
- They enhance efficiency with methods such as key/value caching, dynamic layer selection, and sparse activation, which improve both training and inference performance.
- These models are applied in language modeling, code generation, vision tasks, and streaming translation while balancing trade-offs between scalability and long-context reasoning.
A decoder-only Transformer is a neural sequence model wherein all input tokens are processed via a stack of masked self-attention layers and position-wise feed-forward networks, with the autoregressive causal constraint—each position attends only to earlier positions in the sequence. This architecture originated as a simplification of the standard encoder–decoder Transformer, subsuming modalities such as language modeling, code generation, vision tasks, and streaming translation within autoregressive, unified transformer stacks. Sophisticated variants introduce architectural innovations to maximize computational efficiency, scalability, and flexibility across model families.
1. Architectural Foundations
The canonical decoder-only Transformer comprises L identical blocks, each with multi-head self-attention (masked for causality) and a feed-forward network, plus LayerNorm/residual connections. Input tokens are embedded, augmented with positional or relative encoding, and propagated sequentially through the stack. Causal masking enforces that each query at position only attends over keys/values from positions , implemented by zeroing the upper-triangle in the attention score matrix and/or constraining the attention weights in softmax (Ewer et al., 2024, Zhang et al., 30 Oct 2025).
Each layer computes attention for token as:
Key/Value caching enables fast autoregressive inference: at generation step , the model reuses cached ; only and local attention over the prefix are computed. Complexity for generating each new token is for sequence length , embedding dim , and depth (Ewer et al., 2024, Zhang et al., 30 Oct 2025).
Pretraining employs causal language modeling (maximize over all ). Position information is injected via absolute, rotary, or relative encodings; modern decoder stacks employ RMSNorm, SwiGLU activations, and weight tying for stable scaling.
2. Expressive Power, Universality, and Limitations
Decoder-only architectures excel at efficient autoregressive modeling, but their expressivity is constrained by causal accessibility. Theoretical lower bounds show that on certain algorithmic tasks (e.g., ), depth scales as for -length input if fixed-width and causal masking are used, i.e., global table-based computations are not efficiently implementable (Ewer et al., 2024). Encoder-only stacks (ENTP), in contrast, admit constant-depth solutions for such problems, achieving superior length generalization and sample efficiency in addition, in-context learning, and algorithmic reasoning.
In the context of universal approximation, vanilla causal masking combined with relative position encoding fails to encode absolute positions, resulting in non-universality for position-critical functions. StableMask refines the mask by introducing strictly upper-triangular pseudo-attention slots that soak up surplus attention mass, transforming the regular masked softmax into a form that encodes an absolute positional scalar per token, achieving universal approximation of sequence-to-sequence mappings dependent on position (Yin et al., 2024).
Decoder-only models are thus optimal for (1) highly efficient stream generation, (2) hardware/memory-limited deployments, but limited when flexible global access or bidirectionality is essential.
3. Scaling Laws, Efficiency, and Model Variants
Empirical studies show compute-optimal scaling exponents for test perplexity with decoder-only LLMs closely track encoder–decoder models: , , but decoder-only architectures yield lower perplexity at fixed parameter count, whereas encoder–decoder models attain similar quality with 20–30% less compute (Zhang et al., 30 Oct 2025).
Several innovations optimize decoder-only architectures:
- Parallel Towers (ParallelGPT): Stack decoder blocks into parallel paths, increase embedding dimension, sum weighted output streams. Enables synchronous training and dynamic pruning of towers for inference (Suresh et al., 2024).
- Linear/Convolutional Compression (LinearGPT/ConvGPT): Reduce hidden dimension, and for ConvGPT, also pool sequence length via strided convolution after every few blocks. Achieves 36% parameter reduction and up to 20% accelerated training while maintaining near-baseline language modeling loss (Suresh et al., 2024).
- TreeCoders: Serialize decoder blocks as a -ary tree. Each token routes through transformer nodes via learned selectors, reducing per-token compute complexity from to , enabling sparse activation and parallelism (D'Istria et al., 2024).
- Transformer-VQ: Attains exact softmax-based dense self-attention in time per layer via vector-quantized keys and block-recurrent stateful cache, scaling to k sequence length with speedups at long contexts (Lingle, 2023).
- YOCO: Splits the decoder into a self-decoder stack (efficient attention, constant-size KV cache) and a cross-decoder (global attention over a single cached prefix), reducing prefill memory by and latency by for $1$M context tokens, with preserved full-context modeling capacity (Sun et al., 2024).
These modifications enhance resource efficiency, long-context scalability, distributed deployment, and sparse activation, while retaining or improving downstream performance in code, language, and vision domains.
4. Dynamic Computation and Layer Selection
Dynamic layer selection aims to minimize inference cost by adapting model depth per token or per input sequence:
- Early Exiting: Run only the first layers, exit when a controller signals sufficiency. Can degrade hidden state similarity and accuracy, particularly for shallow exits (Glavas et al., 2024).
- Layer Skipping (Uniform/Random): Execute a fixed number of layers, distributed uniformly or randomly throughout the stack. Uniform skipping preserves hidden state representations ( cosine similarity vs 0.7 for early exit); random skipping is deleterious if early layers are omitted.
- Per-sequence Oracle: Allocate model depth per input, solving an allocation knapsack to maximize quality at a target average cost. Oracle allocation matches full-model performance using only of layers on average (Glavas et al., 2024).
Learned token-wise controllers fail to exploit hidden state information for skip decisions; token-agnostic skip probabilities yield equivalent results. Dynamic computation is most promising when allocation is sequence-aware, motivating efficient prompt-based routing.
5. Multimodal, Vision, and Streaming Extensions
Decoder-only architectures generalize beyond pure language modeling:
- Optical Character Recognition (DTrOCR): Eliminates the vision encoder—patch embeddings of images and text tokens are concatenated and processed as a single sequence through a GPT-2-style decoder stack, outperforming encoder–decoder and masked-LM baselines on scene, handwriting, and Chinese text recognition (Fujitake, 2023). The approach leverages generative language pretraining for cross-modal reasoning.
- Streaming Translation (DST): Jointly processes source/target prefix tokens in a unified decoder-only stack equipped with distinct source/target position encodings and a Streaming Self-Attention (SSA) block. SSA computes a read/write policy determining sufficiency of the source prefix for generation, achieving state-of-the-art BLEU and average lagging in SiMT benchmarks (Guo et al., 2024).
- YOCO/Multimodal: YOCO allows multimodal extensions by stacking multiple self-decoders feeding a unified cross-decoder (Sun et al., 2024).
Table: Empirical BLEU comparison on DeEn SiMT (WMT15) (Guo et al., 2024)
| Method | Avg Lagging (AL) | BLEU |
|---|---|---|
| Wait-k | 3.85 | 26.86 |
| HMT | 4.74 | 30.29 |
| DST | 4.72 | 30.55 |
6. Mathematical Foundations and Interpretability
The Dual Filter framework formalizes the decoder-only Transformer as an iterative solution to a causal nonlinear prediction MMSE problem for HMMs (Chang et al., 1 May 2025). By recasting sequential prediction as a backward stochastic optimal control problem, the approach yields a fixed-point equation on the space of probability measures. Each decoder layer approximates a Picard iteration, directly paralleling masked self-attention, residual, and normalization. Position encoding, normalization, and dimension correspondences clarify the equivalence. Numerical experiments using realistic hyperparameters demonstrate that the dual filter matches transformer inference at all scales.
TreeCoders further contribute interpretability: learned hard routing provides coarse specialization of token paths; sparse activation and distributed mapping enable branch-level analysis of specialization (D'Istria et al., 2024).
7. Practical Applications, Trade-offs, and Future Directions
Decoder-only Transformers dominate large-scale language modeling owing to their autoregressive efficiency, unified architecture, and streamlined token-wise computation. However, extrapolation beyond %%%%3940%%%% train context length leads to performance degradation ("locality decay"), and inference/training throughput falls behind encoder–decoders by $2$– at scale (Zhang et al., 30 Oct 2025).
Trade-offs are notable:
- Autoregressive inference and KV cache reuse favor decoder-only for generation, but limit global context and efficiency for tasks requiring long-range or bidirectional modeling.
- Architectural variants (YOCO, VQ, compression, Tree, parallel towers) provide avenues for efficient adaptation, streaming, longer contexts, and modularity.
- Bidirectional prompt attention and hybrid modules improve finetuning and downstream generalization.
Recommendations for ongoing research:
- Hybrid designs incorporating lightweight encoders for prompts and deep decoders for generation.
- Further study of positional encoding, sparse attention mechanisms, and mixture objectives to extend context length capability.
- Examination of dynamic computation routing, multimodal stacks, and universal approximation strategies.
- Scaling beyond 8 B parameters and empirically validating compute-optimal frontiers.
Decoder-only Transformers remain central, both as practical deployed models and as theoretical exemplars of causal sequence modeling. Their continual evolution is marked by innovations bridging efficiency, expressivity, and adaptability across computational regimes and domains.