Papers
Topics
Authors
Recent
Search
2000 character limit reached

Transformer Decoder LLMs

Updated 28 April 2026
  • Transformer decoder LLMs are deep autoregressive models built with stacked masked self-attention and feed-forward layers for efficient sequence generation.
  • They use causal mechanisms with recurrent key-value caching to enable scalable language modeling and few-shot program synthesis.
  • Innovations like TOVA, DMTD, and architectural compression enhance efficiency, speed, and memory management while reducing computational bottlenecks.

A transformer decoder LLM is a neural architecture consisting of a deep stack of identical decoder blocks, each implementing masked multi-head self-attention and position-wise feed-forward transformations. Unlike encoder-decoder or encoder-only transformers, decoder-only LLMs operate in an autoregressive causal regime, where each token prediction depends only on preceding tokens. This class of architectures—including GPT, Llama, Mistral, and their variants—has become the dominant backbone for modern LLMs, supporting tasks from next-token prediction to few-shot program synthesis and language modeling. Research over the past several years has elucidated the formal computational properties of these models, their efficiency bottlenecks, architectural tradeoffs, and their behavior under various efficiency-driven or interpretability-driven modifications.

1. Architectural Principles and Formal Properties

Transformer decoder LLMs are constructed from NN identical blocks, each composed of masked multi-head self-attention and feed-forward sublayers, plus residuals and layer normalizations. At generation step tt, the input is a concatenation of a prompt prefix [x1,,xm][x_1,\dots,x_m] and previously generated outputs [y1,,yt1][y_1,\dots,y_{t-1}], embedded and passed through the stack. Layer ll computes for position ii:

  • Query/key/value projections: qi,ki,vi=LinearProj(zi+1)q^\ell_i,\,k^\ell_i,\,v^\ell_i = \mathrm{LinearProj}(z^{\ell+1}_i)
  • Causal attention: αij=softmax((qi(kj))/d)\alpha^\ell_{ij} = \mathrm{softmax}((q^\ell_i \cdot (k^\ell_j)^\top)/\sqrt{d}) over jij \leq i
  • Self-attention update: oi=jαijvjo^\ell_i = \sum_j \alpha^\ell_{ij} v^\ell_j
  • Feed-forward and residual: tt0

After tt1 layers, the state tt2 is projected by an LM head to logits over vocabulary tt3: tt4, with tt5.

Fundamentally, decoder-only transformers are Turing-complete under standard idealizations (hardmax attention, infinite-precision weights, shared input/output embeddings, sufficient hidden size), capable of simulating arbitrary rational-weight RNNs and, by extension, universal computation (Roberts, 2023). Their architecture closely aligns with causal B-machines à la Hao Wang: reading only past tokens and appending outputs, enforcing a monotonic information flow.

2. Decoder-Only LLMs as Multi-State RNNs

Decoder-only transformers can be formally recast as unbounded multi-state recurrent neural networks (MSRNNs). The “hidden state” at each layer tt6 is the full stack of accumulated key–value matrices tt7 for all steps so far. This state grows linearly with sequence length. The recurrence for MSRNNs generalizes single-state RNNs to matrix states:

  • Update: tt8
  • For transformers: tt9

This mapping implies that any truncation, pruning, or compression of the key–value cache effectively converts the model into a bounded-memory MSRNN (Oren et al., 2024).

3. Efficiency Bottlenecks and Memory Compression

The major computational bottleneck in decoder-only LLMs is the ever-growing key–value (KV) cache—memory and compute scale with context length. Several approaches tackle this bottleneck:

Token Omission Via Attention (TOVA):

  • TOVA is a training-free, content-aware cache compression policy: at each step, drop the single token to which the current query pays the least attention (argmin of attention weights), and replace its cache row with the new token.
  • Empirically, reducing cache size from [x1,,xm][x_1,\dots,x_m]0 to [x1,,xm][x_1,\dots,x_m]1 (an [x1,,xm][x_1,\dots,x_m]2 reduction) via TOVA maintains perplexity within 0.5 points of the full-cache model, enables [x1,,xm][x_1,\dots,x_m]3 higher decoding throughput, and supports [x1,,xm][x_1,\dots,x_m]4 larger batch sizes at fixed memory (Oren et al., 2024).
  • TOVA’s policy is more effective than windowed truncation or static heuristics and reveals that LLMs, while trained as unbounded MSRNNs, can prune much of their “long-past” state without accuracy loss for most tasks.

Other Cache Management Policies:

  • Fixed window: retain only the most recent [x1,,xm][x_1,\dots,x_m]5 tokens.
  • Window+[x1,,xm][x_1,\dots,x_m]6: keep the initial [x1,,xm][x_1,\dots,x_m]7 tokens and the most recent [x1,,xm][x_1,\dots,x_m]8.
  • Attention-mass policy: retain [x1,,xm][x_1,\dots,x_m]9 tokens with highest cumulative attention.

Aggressive compression yields increasing speed and memory gains, but for tasks requiring long-range retrieval, aggressive pruning induces degradation—TOVA extends the efficiency–accuracy frontier.

4. Direct Multi-Token and Parallel Decoding

Autoregressive decoding is inherently sequential. However, several innovations enable efficient multi-token inference without quality loss:

Direct Multi-Token Decoding (DMTD):

  • DMTD leverages the layer-wise structure of decoder-only transformers: early layers encode context, middle layers reason, and late layers handle output prediction.
  • By reusing only late “decoding” layers ([y1,,yt1][y_1,\dots,y_{t-1}]0) per output token and refilling early/mid-layer KV caches only every [y1,,yt1][y_1,\dots,y_{t-1}]1 steps, the model emits multiple tokens per cycle.
  • Empirically, for Qwen3-4B, using [y1,,yt1][y_1,\dots,y_{t-1}]2 achieves [y1,,yt1][y_1,\dots,y_{t-1}]3 of vanilla accuracy with up to [y1,,yt1][y_1,\dots,y_{t-1}]4 throughput (Luo et al., 13 Oct 2025). Larger models support longer cycles with minimal degradation.
  • DMTD does not require speculative verification or auxiliary models—efficiency is extracted via architectural specialization and cyclical, context-aware refilling.

Hidden Transfer and Tree Attention:

  • The hidden-transfer approach synthesizes future hidden states at an intermediate layer using learned projections, enabling [y1,,yt1][y_1,\dots,y_{t-1}]5 tokens to be generated in parallel through remaining layers.
  • Tree attention bundles all possible candidate expansions in a single forward pass, with a mask enforcing ancestor-only dependencies. Greedy verification ensures lossless decoding: outputs always match the sequential model (Wu et al., 2024).
  • In experiments, tree-attention–based decoding yields empirical 1.5–2.3[y1,,yt1][y_1,\dots,y_{t-1}]6 speedup on 7B–13B models, outperforming other one-model accelerators.

5. Structural and Architectural Variants

Numerous modifications of the base decoder-only transformer architecture exist, targeting efficiency and domain adaptation:

Architectural Compression:

  • ParallelGPT (p-gpt): Splits feature embeddings, processes them in two shorter parallel stacks, and fuses outputs. Permits block-level dropping for further speedup (Suresh et al., 2024).
  • LinearlyCompressedGPT (lc-gpt): Gradually halves the hidden dimensionality after every two layers via linear projections, yielding a 36% parameter reduction and 18% faster training.
  • ConvCompressedGPT (cc-gpt): Uses 1D convolution (kernel size 3) in place of linear reduction for channel halving, facilitating inductive bias for sequences.

Augmented Blocks:

  • Conformer LLM: Adds shallow causal convolutions (e.g., stacked kernels of size 3 and 7) to supplement self-attention, boosting next-token likelihood and improving long-range dependency modeling in text and audio language modeling (Verma, 2023).
  • SONAR-LLM: Operates natively in a frozen sentence-embedding space, generating next-step embeddings and “decoding” them into tokens using a fixed decoder. This approach confers scaling benefits particularly for long-context efficiency and tasks with sentence-level granularity (Dragunov et al., 7 Aug 2025).

Syntax Injection and Adapter Modules:

  • Gated Tree Cross-Attention (GTCA): Attaches a side-branch gated cross-attention to a frozen decoder checkpoint, reading precomputed constituency chunk memory. Training is staged to preserve backbone competence. With a token update mask and gating mechanism, GTCA improves syntactic robustness (BLiMP, CoLA) without harming MCQA/comprehension tasks (Gao et al., 23 Jan 2026).
  • Trained Persistent Memory: Adapter modules can provide decoder-only models with persistent latent memory, interfacing solely via the self-attention path (KV prefix, parallel cross-attention, slot-based or associative memory). Strong inductive biases (explicit cross-attention, Hebbian update, slot-based writing) are necessary for retention at tight capacity (Jeong, 20 Mar 2026).

6. Decoder-Only LLMs Beyond Generation: Dual-Use for Encoding

Although trained for unidirectional autoregressive decoding, decoder-only models have been shown to serve as rich contextual encoders with minimal adaptation:

  • LLM2Vec: By enabling bidirectional attention and introducing masked next-token prediction plus SimCSE-style unsupervised contrastive learning, LLM2Vec converts any decoder-only LLM into a competitive encoder for word- and sequence-level tasks. On Massive Text Embeddings Benchmark (MTEB), LLM2Vec-transformed models (including Mistral-7B) outperform encoder-only baselines and achieve new state-of-the-art in unsupervised and supervised settings among public-data models (BehnamGhader et al., 2024).
  • For machine translation, the decoder stack can substitute for a conventional NMT encoder (with full-attention mask and adaptation block). This yields strong representation learning, 2.4–6.5[y1,,yt1][y_1,\dots,y_{t-1}]7 inference speedup, and 75% reduction in KV cache size (Luo et al., 9 Mar 2025).

These results demonstrate that the “causal” nature of the decoder is algorithmically enforced, not inherently structural, and that generative pretraining imbues representations competitive for retrieval, classification, and downstream transfer.

7. Scaling Laws, Context Extrapolation, and Compute Efficiency

Demonstrated empirically across model families and tasks, decoder-only LLMs exhibit robust power-law scaling—perplexity improves as [y1,,yt1][y_1,\dots,y_{t-1}]8 and total compute as [y1,,yt1][y_1,\dots,y_{t-1}]9, with exponents ll0–ll1 (Zhang et al., 30 Oct 2025). Notably:

  • When plotted versus compute budget, both encoder–decoder and decoder-only models display nearly identical scaling exponents—a reflection of architectural parity at large scale.
  • Context length extrapolation with rotary embeddings is feasible: decoder-only models maintain near-constant perplexity up to 2ll2–4ll3 the training context window before performance degrades (Zhang et al., 30 Oct 2025).
  • After instruction tuning (e.g., FLAN), encoder–decoders and decoder-only LLMs match or exceed each other’s downstream task performance at the same parameter count, with RedLLM (encoder-decoder) often delivering greater inference efficiency (20–30% less FLOPs) (Zhang et al., 30 Oct 2025).

Implications: Decoder-only transformers are not only computationally universal, but also empirically competitive across generative, retrieval, translation, and encoding tasks, underlying the ongoing convergence of traditionally distinct model families.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Transformer Decoder LLMs.