Papers
Topics
Authors
Recent
2000 character limit reached

Decoder-Only Transformer Architectures

Updated 27 January 2026
  • Decoder-only Transformer architectures are neural sequence models that utilize masked self-attention and feed-forward layers to generate outputs in an autoregressive manner.
  • They enhance efficiency with methods such as key/value caching, dynamic layer selection, and sparse activation, which improve both training and inference performance.
  • These models are applied in language modeling, code generation, vision tasks, and streaming translation while balancing trade-offs between scalability and long-context reasoning.

A decoder-only Transformer is a neural sequence model wherein all input tokens are processed via a stack of masked self-attention layers and position-wise feed-forward networks, with the autoregressive causal constraint—each position attends only to earlier positions in the sequence. This architecture originated as a simplification of the standard encoder–decoder Transformer, subsuming modalities such as language modeling, code generation, vision tasks, and streaming translation within autoregressive, unified transformer stacks. Sophisticated variants introduce architectural innovations to maximize computational efficiency, scalability, and flexibility across model families.

1. Architectural Foundations

The canonical decoder-only Transformer comprises L identical blocks, each with multi-head self-attention (masked for causality) and a feed-forward network, plus LayerNorm/residual connections. Input tokens x1:Tx_{1:T} are embedded, augmented with positional or relative encoding, and propagated sequentially through the stack. Causal masking enforces that each query at position jj only attends over keys/values from positions iji\leq j, implemented by zeroing the upper-triangle in the attention score matrix and/or constraining the attention weights in softmax (Ewer et al., 2024, Zhang et al., 30 Oct 2025).

Each layer computes attention for token jj as:

Aj(,h)=i=1jsoftmaxi(Qj(,h)(Ki(,h))Td)Vi(,h)A_j^{(\ell,h)} = \sum_{i=1}^j \operatorname{softmax}_i \left( \frac{Q_j^{(\ell,h)}(K_i^{(\ell,h)})^T}{\sqrt{d}} \right) V_i^{(\ell,h)}

Key/Value caching enables fast autoregressive inference: at generation step tt, the model reuses cached K1:t1,V1:t1K_{1:t-1},V_{1:t-1}; only QtQ_t and local attention over the prefix are computed. Complexity for generating each new token is O(nDL)O(nDL) for sequence length nn, embedding dim DD, and depth LL (Ewer et al., 2024, Zhang et al., 30 Oct 2025).

Pretraining employs causal language modeling (maximize logP(xtx<t)\log P(x_t \mid x_{<t}) over all tt). Position information is injected via absolute, rotary, or relative encodings; modern decoder stacks employ RMSNorm, SwiGLU activations, and weight tying for stable scaling.

2. Expressive Power, Universality, and Limitations

Decoder-only architectures excel at efficient autoregressive modeling, but their expressivity is constrained by causal accessibility. Theoretical lower bounds show that on certain algorithmic tasks (e.g., Count3\operatorname{Count3}), depth scales as Ω(n)\Omega(n) for nn-length input if fixed-width and causal masking are used, i.e., global table-based computations are not efficiently implementable (Ewer et al., 2024). Encoder-only stacks (ENTP), in contrast, admit constant-depth solutions for such problems, achieving superior length generalization and sample efficiency in addition, in-context learning, and algorithmic reasoning.

In the context of universal approximation, vanilla causal masking combined with relative position encoding fails to encode absolute positions, resulting in non-universality for position-critical functions. StableMask refines the mask by introducing strictly upper-triangular pseudo-attention slots that soak up surplus attention mass, transforming the regular masked softmax into a form that encodes an absolute positional scalar per token, achieving universal approximation of sequence-to-sequence mappings dependent on position (Yin et al., 2024).

Decoder-only models are thus optimal for (1) highly efficient stream generation, (2) hardware/memory-limited deployments, but limited when flexible global access or bidirectionality is essential.

3. Scaling Laws, Efficiency, and Model Variants

Empirical studies show compute-optimal scaling exponents for test perplexity with decoder-only LLMs closely track encoder–decoder models: P(C)CαCP(C)\sim C^{-\alpha_C}, P(N)NαNP(N)\sim N^{-\alpha_N}, but decoder-only architectures yield lower perplexity at fixed parameter count, whereas encoder–decoder models attain similar quality with \sim20–30% less compute (Zhang et al., 30 Oct 2025).

Several innovations optimize decoder-only architectures:

  • Parallel Towers (ParallelGPT): Stack decoder blocks into parallel paths, increase embedding dimension, sum weighted output streams. Enables synchronous training and dynamic pruning of towers for inference (Suresh et al., 2024).
  • Linear/Convolutional Compression (LinearGPT/ConvGPT): Reduce hidden dimension, and for ConvGPT, also pool sequence length via strided convolution after every few blocks. Achieves \sim36% parameter reduction and up to \sim20% accelerated training while maintaining near-baseline language modeling loss (Suresh et al., 2024).
  • TreeCoders: Serialize decoder blocks as a kk-ary tree. Each token routes through h=O(logkNtot)h=O(\log_k N_\text{tot}) transformer nodes via learned selectors, reducing per-token compute complexity from O(N)O(N) to O(logN)O(\log N), enabling sparse activation and parallelism (D'Istria et al., 2024).
  • Transformer-VQ: Attains exact softmax-based dense self-attention in O(T)O(T) time per layer via vector-quantized keys and block-recurrent stateful cache, scaling to T=131T=131k sequence length with >10×>10\times speedups at long contexts (Lingle, 2023).
  • YOCO: Splits the decoder into a self-decoder stack (efficient attention, constant-size KV cache) and a cross-decoder (global attention over a single cached prefix), reducing prefill memory by 9.4×9.4\times and latency by 30×30\times for $1$M context tokens, with preserved full-context modeling capacity (Sun et al., 2024).

These modifications enhance resource efficiency, long-context scalability, distributed deployment, and sparse activation, while retaining or improving downstream performance in code, language, and vision domains.

4. Dynamic Computation and Layer Selection

Dynamic layer selection aims to minimize inference cost by adapting model depth per token or per input sequence:

  • Early Exiting: Run only the first LEL_E layers, exit when a controller signals sufficiency. Can degrade hidden state similarity and accuracy, particularly for shallow exits (Glavas et al., 2024).
  • Layer Skipping (Uniform/Random): Execute a fixed number cc of layers, distributed uniformly or randomly throughout the stack. Uniform skipping preserves hidden state representations (>0.9>0.9 cosine similarity vs \sim0.7 for early exit); random skipping is deleterious if early layers are omitted.
  • Per-sequence Oracle: Allocate model depth per input, solving an allocation knapsack to maximize quality at a target average cost. Oracle allocation matches full-model performance using only 23.3%23.3\% of layers on average (Glavas et al., 2024).

Learned token-wise controllers fail to exploit hidden state information for skip decisions; token-agnostic skip probabilities yield equivalent results. Dynamic computation is most promising when allocation is sequence-aware, motivating efficient prompt-based routing.

5. Multimodal, Vision, and Streaming Extensions

Decoder-only architectures generalize beyond pure language modeling:

  • Optical Character Recognition (DTrOCR): Eliminates the vision encoder—patch embeddings of images and text tokens are concatenated and processed as a single sequence through a GPT-2-style decoder stack, outperforming encoder–decoder and masked-LM baselines on scene, handwriting, and Chinese text recognition (Fujitake, 2023). The approach leverages generative language pretraining for cross-modal reasoning.
  • Streaming Translation (DST): Jointly processes source/target prefix tokens in a unified decoder-only stack equipped with distinct source/target position encodings and a Streaming Self-Attention (SSA) block. SSA computes a read/write policy pi,jp_{i,j} determining sufficiency of the source prefix for generation, achieving state-of-the-art BLEU and average lagging in SiMT benchmarks (Guo et al., 2024).
  • YOCO/Multimodal: YOCO allows multimodal extensions by stacking multiple self-decoders feeding a unified cross-decoder (Sun et al., 2024).

Table: Empirical BLEU comparison on De\toEn SiMT (WMT15) (Guo et al., 2024)

Method Avg Lagging (AL) BLEU
Wait-k 3.85 26.86
HMT 4.74 30.29
DST 4.72 30.55

6. Mathematical Foundations and Interpretability

The Dual Filter framework formalizes the decoder-only Transformer as an iterative solution to a causal nonlinear prediction MMSE problem for HMMs (Chang et al., 1 May 2025). By recasting sequential prediction as a backward stochastic optimal control problem, the approach yields a fixed-point equation on the space of probability measures. Each decoder layer approximates a Picard iteration, directly paralleling masked self-attention, residual, and normalization. Position encoding, normalization, and dimension correspondences clarify the equivalence. Numerical experiments using realistic hyperparameters demonstrate that the dual filter matches transformer inference at all scales.

TreeCoders further contribute interpretability: learned hard routing provides coarse specialization of token paths; sparse activation and distributed mapping enable branch-level analysis of specialization (D'Istria et al., 2024).

7. Practical Applications, Trade-offs, and Future Directions

Decoder-only Transformers dominate large-scale language modeling owing to their autoregressive efficiency, unified architecture, and streamlined token-wise computation. However, extrapolation beyond %%%%39iji\leq j40%%%% train context length leads to performance degradation ("locality decay"), and inference/training throughput falls behind encoder–decoders by $2$–3×3\times at scale (Zhang et al., 30 Oct 2025).

Trade-offs are notable:

  • Autoregressive inference and KV cache reuse favor decoder-only for generation, but limit global context and efficiency for tasks requiring long-range or bidirectional modeling.
  • Architectural variants (YOCO, VQ, compression, Tree, parallel towers) provide avenues for efficient adaptation, streaming, longer contexts, and modularity.
  • Bidirectional prompt attention and hybrid modules improve finetuning and downstream generalization.

Recommendations for ongoing research:

  • Hybrid designs incorporating lightweight encoders for prompts and deep decoders for generation.
  • Further study of positional encoding, sparse attention mechanisms, and mixture objectives to extend context length capability.
  • Examination of dynamic computation routing, multimodal stacks, and universal approximation strategies.
  • Scaling beyond 8 B parameters and empirically validating compute-optimal frontiers.

Decoder-only Transformers remain central, both as practical deployed models and as theoretical exemplars of causal sequence modeling. Their continual evolution is marked by innovations bridging efficiency, expressivity, and adaptability across computational regimes and domains.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decoder-Only Transformer Architectures.