Papers
Topics
Authors
Recent
Search
2000 character limit reached

Decoder-Only Transformer Architectures

Updated 27 January 2026
  • Decoder-only Transformer architectures are neural sequence models that utilize masked self-attention and feed-forward layers to generate outputs in an autoregressive manner.
  • They enhance efficiency with methods such as key/value caching, dynamic layer selection, and sparse activation, which improve both training and inference performance.
  • These models are applied in language modeling, code generation, vision tasks, and streaming translation while balancing trade-offs between scalability and long-context reasoning.

A decoder-only Transformer is a neural sequence model wherein all input tokens are processed via a stack of masked self-attention layers and position-wise feed-forward networks, with the autoregressive causal constraint—each position attends only to earlier positions in the sequence. This architecture originated as a simplification of the standard encoder–decoder Transformer, subsuming modalities such as language modeling, code generation, vision tasks, and streaming translation within autoregressive, unified transformer stacks. Sophisticated variants introduce architectural innovations to maximize computational efficiency, scalability, and flexibility across model families.

1. Architectural Foundations

The canonical decoder-only Transformer comprises L identical blocks, each with multi-head self-attention (masked for causality) and a feed-forward network, plus LayerNorm/residual connections. Input tokens x1:Tx_{1:T} are embedded, augmented with positional or relative encoding, and propagated sequentially through the stack. Causal masking enforces that each query at position jj only attends over keys/values from positions iji\leq j, implemented by zeroing the upper-triangle in the attention score matrix and/or constraining the attention weights in softmax (Ewer et al., 2024, Zhang et al., 30 Oct 2025).

Each layer computes attention for token jj as:

Aj(,h)=i=1jsoftmaxi(Qj(,h)(Ki(,h))Td)Vi(,h)A_j^{(\ell,h)} = \sum_{i=1}^j \operatorname{softmax}_i \left( \frac{Q_j^{(\ell,h)}(K_i^{(\ell,h)})^T}{\sqrt{d}} \right) V_i^{(\ell,h)}

Key/Value caching enables fast autoregressive inference: at generation step tt, the model reuses cached K1:t1,V1:t1K_{1:t-1},V_{1:t-1}; only QtQ_t and local attention over the prefix are computed. Complexity for generating each new token is O(nDL)O(nDL) for sequence length nn, embedding dim jj0, and depth jj1 (Ewer et al., 2024, Zhang et al., 30 Oct 2025).

Pretraining employs causal language modeling (maximize jj2 over all jj3). Position information is injected via absolute, rotary, or relative encodings; modern decoder stacks employ RMSNorm, SwiGLU activations, and weight tying for stable scaling.

2. Expressive Power, Universality, and Limitations

Decoder-only architectures excel at efficient autoregressive modeling, but their expressivity is constrained by causal accessibility. Theoretical lower bounds show that on certain algorithmic tasks (e.g., jj4), depth scales as jj5 for jj6-length input if fixed-width and causal masking are used, i.e., global table-based computations are not efficiently implementable (Ewer et al., 2024). Encoder-only stacks (ENTP), in contrast, admit constant-depth solutions for such problems, achieving superior length generalization and sample efficiency in addition, in-context learning, and algorithmic reasoning.

In the context of universal approximation, vanilla causal masking combined with relative position encoding fails to encode absolute positions, resulting in non-universality for position-critical functions. StableMask refines the mask by introducing strictly upper-triangular pseudo-attention slots that soak up surplus attention mass, transforming the regular masked softmax into a form that encodes an absolute positional scalar per token, achieving universal approximation of sequence-to-sequence mappings dependent on position (Yin et al., 2024).

Decoder-only models are thus optimal for (1) highly efficient stream generation, (2) hardware/memory-limited deployments, but limited when flexible global access or bidirectionality is essential.

3. Scaling Laws, Efficiency, and Model Variants

Empirical studies show compute-optimal scaling exponents for test perplexity with decoder-only LLMs closely track encoder–decoder models: jj7, jj8, but decoder-only architectures yield lower perplexity at fixed parameter count, whereas encoder–decoder models attain similar quality with jj920–30% less compute (Zhang et al., 30 Oct 2025).

Several innovations optimize decoder-only architectures:

  • Parallel Towers (ParallelGPT): Stack decoder blocks into parallel paths, increase embedding dimension, sum weighted output streams. Enables synchronous training and dynamic pruning of towers for inference (Suresh et al., 2024).
  • Linear/Convolutional Compression (LinearGPT/ConvGPT): Reduce hidden dimension, and for ConvGPT, also pool sequence length via strided convolution after every few blocks. Achieves iji\leq j036% parameter reduction and up to iji\leq j120% accelerated training while maintaining near-baseline language modeling loss (Suresh et al., 2024).
  • TreeCoders: Serialize decoder blocks as a iji\leq j2-ary tree. Each token routes through iji\leq j3 transformer nodes via learned selectors, reducing per-token compute complexity from iji\leq j4 to iji\leq j5, enabling sparse activation and parallelism (D'Istria et al., 2024).
  • Transformer-VQ: Attains exact softmax-based dense self-attention in iji\leq j6 time per layer via vector-quantized keys and block-recurrent stateful cache, scaling to iji\leq j7k sequence length with iji\leq j8 speedups at long contexts (Lingle, 2023).
  • YOCO: Splits the decoder into a self-decoder stack (efficient attention, constant-size KV cache) and a cross-decoder (global attention over a single cached prefix), reducing prefill memory by iji\leq j9 and latency by jj0 for jj1M context tokens, with preserved full-context modeling capacity (Sun et al., 2024).

These modifications enhance resource efficiency, long-context scalability, distributed deployment, and sparse activation, while retaining or improving downstream performance in code, language, and vision domains.

4. Dynamic Computation and Layer Selection

Dynamic layer selection aims to minimize inference cost by adapting model depth per token or per input sequence:

  • Early Exiting: Run only the first jj2 layers, exit when a controller signals sufficiency. Can degrade hidden state similarity and accuracy, particularly for shallow exits (Glavas et al., 2024).
  • Layer Skipping (Uniform/Random): Execute a fixed number jj3 of layers, distributed uniformly or randomly throughout the stack. Uniform skipping preserves hidden state representations (jj4 cosine similarity vs jj50.7 for early exit); random skipping is deleterious if early layers are omitted.
  • Per-sequence Oracle: Allocate model depth per input, solving an allocation knapsack to maximize quality at a target average cost. Oracle allocation matches full-model performance using only jj6 of layers on average (Glavas et al., 2024).

Learned token-wise controllers fail to exploit hidden state information for skip decisions; token-agnostic skip probabilities yield equivalent results. Dynamic computation is most promising when allocation is sequence-aware, motivating efficient prompt-based routing.

5. Multimodal, Vision, and Streaming Extensions

Decoder-only architectures generalize beyond pure language modeling:

  • Optical Character Recognition (DTrOCR): Eliminates the vision encoder—patch embeddings of images and text tokens are concatenated and processed as a single sequence through a GPT-2-style decoder stack, outperforming encoder–decoder and masked-LM baselines on scene, handwriting, and Chinese text recognition (Fujitake, 2023). The approach leverages generative language pretraining for cross-modal reasoning.
  • Streaming Translation (DST): Jointly processes source/target prefix tokens in a unified decoder-only stack equipped with distinct source/target position encodings and a Streaming Self-Attention (SSA) block. SSA computes a read/write policy jj7 determining sufficiency of the source prefix for generation, achieving state-of-the-art BLEU and average lagging in SiMT benchmarks (Guo et al., 2024).
  • YOCO/Multimodal: YOCO allows multimodal extensions by stacking multiple self-decoders feeding a unified cross-decoder (Sun et al., 2024).

Table: Empirical BLEU comparison on Dejj8En SiMT (WMT15) (Guo et al., 2024)

Method Avg Lagging (AL) BLEU
Wait-k 3.85 26.86
HMT 4.74 30.29
DST 4.72 30.55

6. Mathematical Foundations and Interpretability

The Dual Filter framework formalizes the decoder-only Transformer as an iterative solution to a causal nonlinear prediction MMSE problem for HMMs (Chang et al., 1 May 2025). By recasting sequential prediction as a backward stochastic optimal control problem, the approach yields a fixed-point equation on the space of probability measures. Each decoder layer approximates a Picard iteration, directly paralleling masked self-attention, residual, and normalization. Position encoding, normalization, and dimension correspondences clarify the equivalence. Numerical experiments using realistic hyperparameters demonstrate that the dual filter matches transformer inference at all scales.

TreeCoders further contribute interpretability: learned hard routing provides coarse specialization of token paths; sparse activation and distributed mapping enable branch-level analysis of specialization (D'Istria et al., 2024).

7. Practical Applications, Trade-offs, and Future Directions

Decoder-only Transformers dominate large-scale language modeling owing to their autoregressive efficiency, unified architecture, and streamlined token-wise computation. However, extrapolation beyond %%%%39iji\leq j40%%%% train context length leads to performance degradation ("locality decay"), and inference/training throughput falls behind encoder–decoders by Aj(,h)=i=1jsoftmaxi(Qj(,h)(Ki(,h))Td)Vi(,h)A_j^{(\ell,h)} = \sum_{i=1}^j \operatorname{softmax}_i \left( \frac{Q_j^{(\ell,h)}(K_i^{(\ell,h)})^T}{\sqrt{d}} \right) V_i^{(\ell,h)}1–Aj(,h)=i=1jsoftmaxi(Qj(,h)(Ki(,h))Td)Vi(,h)A_j^{(\ell,h)} = \sum_{i=1}^j \operatorname{softmax}_i \left( \frac{Q_j^{(\ell,h)}(K_i^{(\ell,h)})^T}{\sqrt{d}} \right) V_i^{(\ell,h)}2 at scale (Zhang et al., 30 Oct 2025).

Trade-offs are notable:

  • Autoregressive inference and KV cache reuse favor decoder-only for generation, but limit global context and efficiency for tasks requiring long-range or bidirectional modeling.
  • Architectural variants (YOCO, VQ, compression, Tree, parallel towers) provide avenues for efficient adaptation, streaming, longer contexts, and modularity.
  • Bidirectional prompt attention and hybrid modules improve finetuning and downstream generalization.

Recommendations for ongoing research:

  • Hybrid designs incorporating lightweight encoders for prompts and deep decoders for generation.
  • Further study of positional encoding, sparse attention mechanisms, and mixture objectives to extend context length capability.
  • Examination of dynamic computation routing, multimodal stacks, and universal approximation strategies.
  • Scaling beyond 8 B parameters and empirically validating compute-optimal frontiers.

Decoder-only Transformers remain central, both as practical deployed models and as theoretical exemplars of causal sequence modeling. Their continual evolution is marked by innovations bridging efficiency, expressivity, and adaptability across computational regimes and domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decoder-Only Transformer Architectures.