Finetuning Decoder-Only LLMs
- Finetuning decoder-only LLMs are techniques that adapt autoregressive models to new domains using efficient parameter updates and optimized attention mechanisms.
- Attention mask manipulation and selective unmasking enhance performance on tasks like NER and code search by incorporating necessary bidirectional context.
- Parameter-efficient strategies such as LoRA, LNA, and DCFT reduce computational costs while maintaining or improving performance across modalities like speech, code, and semantic parsing.
Finetuning decoder-only LLMs involves adapting large-scale, transformer-based models—characterized by autoregressive, left-to-right token generation—to excel in specialized downstream tasks. Decoder-only architectures, originally engineered for next-token prediction, are now dominant in open and commercial LLMs due to their flexible scaling and broad applicability. Finetuning these models for new domains and modalities, while balancing resource constraints, parameter efficiency, and task generalization, has become a central focus in contemporary NLP research.
1. Architectures and Modalities in Decoder-Only Fine-Tuning
Decoder-only LLMs are defined by their causal, autoregressive generation mechanisms, in which each output token prediction is dependent only on previously generated tokens. Recent research extends their use far beyond conventional LLMing, targeting tasks such as speech-to-text translation, code search, semantic graph parsing, and text-to-image generation.
In the context of speech-to-text, approaches such as Speech-LLaMA adopt a deep integration paradigm, mapping continuous acoustic features—preprocessed using CTC compression and an audio encoder—directly into the semantic space of a pretrained LLM. This enables conditioning not only on textual prompts but also on rich, continuous speech embeddings, with the probability of generating a sequence defined in an autoregressive manner. End-to-end, from-scratch variants further simplify the pipeline, training a convolutional 2D encoder paired with a smaller decoder-only network with no explicit text encoder, demonstrating parameter efficiency and end-to-end learnability (Wu et al., 2023).
In code search, decoder-only LLMs are adapted—sometimes with causal masking replaced by bidirectional attention mechanisms via Masked Next Token Prediction (MNTP)—to serve embedding and retrieval tasks, demonstrating new structural versatility (Chen et al., 29 Oct 2024). For semantic parsing tasks (e.g., AMR), fine-tuned models such as LLaMA 3.2 with Grouped-Query Attention (GQA) and DeepSeek R1 with chain-of-thought pretraining achieve SMATCH F1 scores rivaling state-of-the-art parsers, confirming robust representational expressivity (Ho, 7 Aug 2025).
2. Parameter-Efficient Fine-Tuning Strategies
To circumvent the prohibitive resource requirements of full-model fine-tuning, parameter-efficient adaptation techniques ("PEFT"; Editor's term) have become standard. These approaches introduce small, trainable adapter modules or low-rank updates within the fixed pretrained backbone, drastically reducing the number of updated parameters while maintaining—sometimes improving—downstream performance.
Low-Rank Adaptation (LoRA) and its quantized variant (QLoRA) inject rank-constrained updates into weight matrices (e.g., , , , ) in the transformer's attention modules. Only the adapter weights (often with ) are fine-tuned, with quantization (4-, 8-, or 16-bit precision) applied to further reduce memory footprint. Systematic evaluation shows that bumps in PEFT rank and higher quantization yield improvements on low-resource languages and harder tasks; for tasks such as MLQA and XLSUM, PEFT-finetuned LLaMA-2-7B and Mistral-7B approach or surpass much larger proprietary models (Aggarwal et al., 15 Jan 2024).
LayerNorm and Attention (LNA) fine-tuning selectively updates only the normalization and multi-head self-attention submodules, further decreasing the adaptation set. For speech translation, LNA outperforms LoRA, achieving BLEU scores of 37.1 (CoVoST 2) and 23.4 (FLEURS) with minimal compute (Huang et al., 3 Jul 2024).
Deconvolution in Subspace (DCFT) generalizes LoRA by learning low-rank adjustments, then enhancing them with deconvolution (transpose convolution) for subspace feature reconstruction, reducing trained parameters up to 8x versus LoRA, often with superior accuracy (Zhang et al., 3 Mar 2025).
PEFT Technique | Adapted Parameters | Typical Rank/Setting | Notable Benefit |
---|---|---|---|
LoRA/QLoRA | Attention, FFN weights | Memory, compute efficiency | |
LNA | LayerNorm, Attention only | N/A | Robust, outperforms LoRA (speech) |
DCFT | Low-rank + deconv. features | Kernel size adjustable | Up to fewer parameters |
3. Attention Mask Manipulation and Context Utilization
Decoder-only LLMs rely on causal masking, which restricts each token’s receptive field to its leftward context. For tasks demanding bidirectional context—such as sequence labeling or dense retrieval—new fine-tuning paradigms partially or selectively remove causal masking.
One method partitions decoder blocks into groups, experimenting with different layer-wise unmasking patterns (encoded as binary vectors) to identify optimal tradeoffs. Selective, rather than full, unmasking yields F1 improvements on NER, aspect-based sentiment, and trigger classification, often surpassing encoder-based and instruction-tuned baselines (Dukić et al., 25 Jan 2024). Improvements are correlated to the Right-side Dependency Relations Ratio (RDRR); higher RDRR tasks benefit more from rightward context. However, indiscriminate full unmasking can harm tasks with less right-context dependency.
Switching to bidirectional attention is also crucial when adapting LLMs for embedding-based retrieval (as in code search). MNTP pre-training and supervised contrastive fine-tuning allow decoder-only models to approach or exceed the retrieval accuracy of encoder-only models (Chen et al., 29 Oct 2024).
4. Scaling Laws and Model Adaptation
Decoder-only models exhibit scaling laws, where test loss follows inverse power laws with model or data scale: , and more generally . Scaling law fits accurately predict test loss trends in regimes with sufficient model sizes and in-domain data, but extrapolation can become unreliable for model sizes beyond the training set or shifts in domain/language direction. Width-scaling (increasing hidden dimension) generally yields better hardware efficiency and throughput than depth-scaling (adding layers), though both offer similar accuracy gains per FLOP (Caillaut et al., 23 Sep 2024).
A plausible implication is that practitioners should determine the optimal scale by balancing model capacity, language/domain coverage, and available computation, with PEFT methods providing cost-effective pathways for adaptation even in resource-constrained settings.
5. Modality and Task-Specific Integration
Multimodal adaptation of decoder-only models leverages architectural innovations to align heterogeneous input representations. Continuous speech features, when processed via well-designed adapters or compressed through CTC, can be merged with text embeddings in the semantic space of the decoder (Wu et al., 2023, Huang et al., 3 Jul 2024). In text-to-image generation, adapter modules fuse LLM-derived semantic representations with diffusion model cross-attention modules, with only lightweight fusion weights trained, resulting in notable improvements in semantic control and generation quality (Dong et al., 6 Feb 2025).
For multilingual NMT, decoder-only models underperform when trained in a vanilla manner on parallel data, attributed to weak language transfer capacity. The two-stage decoder-only (TDO) architecture mitigates this by splitting the process: stage one aligns source representations with target language (excluding target tokens), stage two fuses the target tokens, with possible additional contrastive loss (InstruCL) imposed on translation instruction representations. This yields substantial gains in zero-shot translation (up to +3.39 BLEU, +6.99 chrF++, +3.22 BERTScore, +4.81 COMET) (Qu et al., 3 Dec 2024).
6. Evaluation, Generalization, and Future Directions
AMR parsing experiments demonstrate that straightforward LoRA-based finetuning of decoder-only LLMs (with thoughtful architectural selection, e.g., GQA or chain-of-thought training) can match or even approach specialized state-of-the-art parsers, as shown by SMATCH F1 scores (0.804 for LLaMA 3.2, matching APT + Silver and approaching Graphene Smatch at 0.854). Notably, different models trade off between semantic fidelity and structural validity, suggesting optimizations may be tailored to downstream graph depth and complexity (Ho, 7 Aug 2025).
Advanced embedding tasks reveal the significance of input modification over architectural overhaul: Causal2Vec, for example, prepends a Contextual token from a lightweight bidirectional encoder to the input, enabling each token to indirectly capture full context and improving both efficiency (sequence length reduced by up to 85%, inference time by up to 82%) and performance on benchmarks (Lin et al., 31 Jul 2025).
A key pattern is the ongoing search for mechanisms that combine parameter efficiency, stable generalization across languages and modalities, and task-specific adaptation—all under the cost constraints posed by ever-larger LLMs. Future work includes dynamic adapter allocation, hybrid attention strategies, improved architectural transfer for zero-shot tasks, and robust PEFT for very large models.
In summary, finetuning decoder-only LLMs now encompasses a rich set of methodologies—parameter-efficient adaptation, attention manipulation, multimodal integration, scaling optimization, and embedding augmentation—enabling these architectures to excel and generalize across a growing spectrum of data regimes, languages, and downstream application domains.