Decoder-Only Large Language Models

Updated 21 December 2025

Decoder-only LLMs are transformer-based models that use causal self-attention to predict tokens autoregressively, enabling scalable and efficient generative language modeling.
They leverage innovations such as dynamic inference techniques and architectural modifications to balance compute costs and performance in tasks like translation and code search.
Despite their efficiency, decoder-only models face challenges with bidirectional context and complex reasoning, prompting research into hybrid and modular adaptations.

Decoder-only LLMs are a class of neural sequence models that use a stack of causal self-attention and feed-forward sublayers to model conditional distributions over sequences. Unlike encoder-decoder architectures, the decoder-only paradigm processes the entire input sequence as a single prompt and generates outputs autoregressively, predicting each token conditioned only on its leftward context. This architectural simplicity, coupled with scalability to billions of parameters and efficient inference, has enabled decoder-only LLMs to dominate recent progress in generative language modeling, in-context learning, and low-latency sequence generation.

1. Architectural Foundations

Decoder-only LLMs comprise a stack of $L$ identical transformer blocks, each containing masked (causal) multi-head self-attention, position-wise feed-forward networks (FFN), layer normalization, and residual connections. Input tokens are embedded and processed sequentially, with the model enforcing a causal attention mask $M_{i,j}$ that blocks each position $i$ from attending to any position $j > i$ : $M_{i,j} = \begin{cases} 0, & j \le i \ -\infty, & j > i \end{cases}$ The core operation within each block is scaled dot-product attention: $\text{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^{\top} + M}{\sqrt{d_k}}\right)V$ where $Q$ , $K$ , and $V$ are query, key, and value projections derived from the previous layer's hidden states.

Model variants have introduced architectural modifications to improve efficiency and deployment. For example, ParallelGPT splits the decoder stack into two parallel "towers," merging their outputs with learned weights; LinearlyCompressedGPT reduces dimensionality between block groups via linear projections; ConvCompressedGPT replaces these with 1D convolutions to exploit local structure and reduce parameters. These modifications trade off memory and compute for small performance reductions, supporting on-device and low-resource settings (Suresh et al., 22 Apr 2024).

Table: Example Decoder-Only Model Configurations

Model	Layers	Hidden Dim	Heads	Parameters
XGLM	24	1,024	16	500M
LLaMA 2-7B	32	4,096	32	7B
StarCoder	40	d_model	H	15.5B

2. Training Objectives and Sequence Modeling

Decoder-only LLMs are trained with the standard next-token prediction (causal language modeling) objective: $L(\theta) = -\sum_{t=1}^T \log p(x_t|x_{<t};\theta)$ where $x_{<t}$ is the concatenated prompt and $h_{t-1}$ is the last hidden state. The model learns to compute

$p(x_t|x_{<t};\theta) = \mathrm{softmax}(W h_{t-1})$

The causal masking ensures autoregressive inference, precluding explicit bidirectional context.

Augmentations to the basic objective include interleaving span-masked language modeling or masked next-token prediction to improve robustness, particularly in code-focused LLMs (Chen et al., 29 Oct 2024).

3. Context Handling, Adaptation, and Task Performance

Context Handling

Decoder-only LLMs process the entire prompt as a sequence, with source and target concatenated for tasks such as machine translation. The model must "memorize" the alignment between input and output via token and positional embeddings, without explicit separation between encoding and decoding phases (M. et al., 12 Sep 2024).

While this enables highly flexible in-context learning (ICL), it also imposes limitations: models struggle with tasks requiring explicit bidirectional context, as in sequence labeling or time-dependent scientific prediction (e.g., PDE simulation). In such tasks, partial or full removal of the causal mask during fine-tuning can dramatically improve token-level prediction performance, matching or exceeding strong encoder-based baselines (Dukić et al., 25 Jan 2024). Data-centric wrappers (Parallel Flipping, Sequence Doubling) can similarly close the performance gap for autoregressive architectures on bidirectional tasks (García-de-Herreros et al., 6 Oct 2025).

Empirical Task Performance

Machine Translation: In-context learning with decoder-only models such as XGLM-500M achieves BLEU 3–6, significantly lagging encoder-decoder models (mT5-base BLEU 8–15, a 5–9 BLEU improvement). Fine-tuned LLaMA 2-7B achieves BLEU ≈ 13.5–14.0 (en→hi), still outperformed by mT5-base (14.14). The quality gap grows in many-to-one/many-to-many settings (M. et al., 12 Sep 2024).
Sequence Labeling: Layer-wise masking removal in LLaMA2-7B yields F1 ≈ 92.0 on NER (CoNLL03), surpassing RoBERTa-large (F1=90.0). Selective unmasking in late or early layers matches or exceeds full unmasking (Dukić et al., 25 Jan 2024).
Code Search: Fine-tuned decoder-only LLMs (e.g., CodeGemma) achieve up to 5.57% higher MRR than encoder-only UniXcoder on CSN and a 49.6% improvement in MAP on CoSQA⁺ after supervised contrastive fine-tuning (Chen et al., 29 Oct 2024).
Causal Reasoning: Decoder-only models are brittle under distributional shifts and deep reasoning chains, suffering performance collapse on tasks requiring latent-space aggregation. Encoder-based models maintain stable logical invariants through bidirectional projections, and only extremely large decoder-only LLMs (e.g., GPT-5) reach parity (Roy et al., 11 Dec 2025).
Scientific Cross-Modal Adaptation: Without architectural modification, decoder-only LLMs underperform (2–4× higher nRMSE than encoder-only) on PDEs. Post-hoc bidirectional context simulation (e.g. Sequence Doubling) recovers most of the lost performance (García-de-Herreros et al., 6 Oct 2025).

4. Scaling Laws, Efficiency, and Implementation Advances

Decoder-only LLMs exhibit scaling laws for cross-entropy loss $L$ as a function of non-embedding parameters $N$ and training set size $D$ , following relations such as

$L(N) = \alpha N^{-p} + \beta \qquad L(N, D) = E + \frac{a}{N^{\alpha} + b/D^{\beta}}$

with empirical exponents $p\approx 0.08$ (single-variable), $\alpha\approx 0.75$ , $\beta\approx0.30$ (two-variable), and a loss floor for sufficiently large $N$ and $D$ . Notably, these scaling laws do not generalize across domains or extrapolate reliably beyond the observed regime; careful re-fitting is required for new domains or larger models (Caillaut et al., 23 Sep 2024).

Depth and width scaling both yield similar per-unit-compute perplexity improvements, but width-scaling provides higher sample throughput on modern hardware (Caillaut et al., 23 Sep 2024). Compressed architectures enable 36%+ parameter reduction for on-device use; uniform scaling of decoder depth does not impair efficiency if dimensionality is adjusted in tandem (Suresh et al., 22 Apr 2024).

Pure autoregressive inference results in high per-token compute cost, but dynamic inference techniques, such as layer skipping or early exit, enable trading accuracy for speed—uniform layer skipping matches full-model accuracy using as little as 23.3% of layers on average per sequence (Glavas et al., 26 Oct 2024). Direct Multi-Token Decoding (DMTD) increases throughput up to 2× by reusing late-decoding layers for several tokens post-processing, at minor accuracy cost (Luo et al., 13 Oct 2025). YOCO architectures cache key-value pairs only once, reducing memory and prefill latency by orders of magnitude, supporting million-token contexts with high throughput (Sun et al., 8 May 2024).

Table: Selected Efficiency Improvements

Method	Speedup	Accuracy Loss
DMTD (τ=4)	2.15×	3.7% (vs vanilla)
YOCO (prefill)	30×	≤1 NLL vs base
Layer Skipping (23.3%)	–	None (oracle)

5. Embedding, Retrieval, and Modality Adaptation

While decoder-only architectures were originally not considered for dense embeddings due to focus on unidirectional context, recent advances allow leveraging LLMs for embedding and retrieval tasks. Causal2Vec prepends a single "Contextual token," generated via an external lightweight BERT-style encoder, to each sequence; final embeddings concatenate hidden states at the Contextual and EOS positions. This yields state-of-the-art results on MTEB, reducing sequence length and inference time by up to 85% and 82% respectively compared to methods that rely on in-context examples or input repetition (Lin et al., 31 Jul 2025). This approach does not require architectural changes or causal mask removal, validating the viability of LLM-based embedding with competitive efficiency.

6. Specialization, Modularity, and Multi-Task Adaptation

Parameter-efficient adaptation mechanisms extend the utility of decoder-only LLMs. LoRA adapters, Mixture-of-Experts (MoE) routing, and training strategies like curriculum learning enable rapid multi-language or multi-domain code translation with minimal additional parameters. For instance, SteloCoder augments a StarCoder (15.5B) backbone with five per-language LoRA-expert branches (0.06% parameters per expert), a gating network, and achieves >3.5 CodeBLEU improvement over previous leaderboard baselines (mean 73.76) for multi-language to Python translation, all at marginal compute and time cost (Pan et al., 2023).

Such modular adaptation mechanisms can generalize to a range of domains and tasks, with routing at inference enabling specialization without retraining the full backbone.

7. Limitations and Future Developments

Decoder-only LLMs present limitations in tasks demanding robust bidirectional context, latent-state aggregation, or explict multi-hop reasoning, as in causal reasoning or scientific simulations. While very large models can partially overcome these deficits, encoder or encoder-decoder architectures remain preferable in resource-constrained or distribution-shifted regimes (Roy et al., 11 Dec 2025).

Scaling laws require domain- and range-specific recalibration and do not guarantee reliable extrapolation across domains or languages (Caillaut et al., 23 Sep 2024). Open questions include the optimal design of dynamic computation controllers, further compression schemes, hybrid attention mechanisms, and integration of bidirectional context without violating autoregressive constraints.

Task-specific architectural modifications, such as explicit bidirectional layers for sequence labeling (Dukić et al., 25 Jan 2024) or prompt-level dynamic inference (Glavas et al., 26 Oct 2024), are likely directions for future research. Modular extensions (adapters, MoE, cross-modal inputs) and efficient context caching (YOCO) continue to advance scalability, efficiency, and adaptivity.

Overall, decoder-only LLMs represent a scalable and efficient backbone for generative and language modeling tasks, demonstrating strong adaptability with appropriate task-specific architectural and training strategies, but remain subject to architectural trade-offs in the context of bidirectional reasoning and robust generalization.