Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adapter-Based Finetuning

Updated 6 March 2026
  • Adapter-based finetuning is a transfer learning paradigm that injects compact, trainable modules into frozen pre-trained models, reducing the number of trainable parameters.
  • This method significantly lowers memory and computation costs while maintaining strong performance across NLP, speech, computer vision, and multimodal tasks.
  • It enables modular task-specific updates and mitigates catastrophic forgetting, offering scalable and adaptable fine-tuning for diverse domains.

Adapter-based finetuning is a parameter-efficient transfer learning paradigm that injects lightweight, trainable modules—adapters—into frozen, large-scale pre-trained models. Instead of updating all model parameters for each downstream task, only the adapter modules are trained, yielding substantial memory and computation savings while retaining, and often improving, task performance and generalization. Adapter-based finetuning has been adopted and rigorously evaluated across natural language processing, speech, computer vision, and multimodal domains, with evolving architectural variants and empirical insights.

1. Concept and Motivation

Adapter modules implement a compact two-layer bottleneck: an input is down-projected to a low-dimensional subspace, processed with a nonlinearity, then re-projected to the original dimension and added back via a residual connection. For hidden state h∈Rdh\in \mathbb{R}^d, a typical adapter applies:

h′=h+Wup σ(Wdownh)h' = h + W_\text{up}\, \sigma(W_\text{down} h)

where Wdown∈Rr×dW_\text{down} \in\mathbb{R}^{r\times d}, Wup∈Rd×rW_\text{up} \in\mathbb{R}^{d\times r}, r≪dr \ll d, and σ\sigma is a nonlinearity (e.g., ReLU, tanh). Only the adapter weights are updated during finetuning; all backbone parameters (e.g., Transformer blocks) remain frozen.

This design offers several advantages:

2. Core Architectures and Variants

The classical adapter, as popularized by Houlsby et al., is a two-layer serial bottleneck inserted after self-attention and feed-forward sublayers in each Transformer block (Mundra et al., 2023). Variants and extensions include:

In computer vision, vision-specific adapters integrate convolutions or multi-scale filters (e.g., Mona) (Yin et al., 2023) and block-specific designs such as dynamic routing and prompt generators (e.g., Adapter-X) (Li et al., 2024) have demonstrated significant gains.

3. Insertion Policies and Freezing Strategies

Adapters are typically inserted:

All original backbone weights (attention, MLPs, embeddings, positional encodings) are frozen. Only the adapter weights (and sometimes task-specific head layers) are trainable. This strict freezing is central to memory and compute efficiency and prevents catastrophic shifts in generic representations (He et al., 2021, Eichenberg et al., 2021). Selective adapter freezing (SAFE) further improves memory/computation efficiency by dynamically freezing unimportant adapters during training using activation similarity metrics (CKA) (Son et al., 2024).

4. Training Objectives, Optimization, and Resource Profiles

Adapter-based finetuning adopts the standard task loss (cross-entropy, CTC, MSE, contrastive, etc.) as in full fine-tuning (Layoun et al., 2022, Hsieh et al., 2022, Kim et al., 2024). The optimizer (typically Adam or AdamW) and schedules usually mirror full fine-tuning but employ higher learning rates for adapter parameters—often 5–10× main-model fine-tuning rates due to the reduced parameter count (Le et al., 2021).

Resource footprint is consistently lower:

Care must be taken with batch size and learning rate to maintain efficiency, especially in high-throughput or streaming applications (Bai et al., 2024, Hsieh et al., 2022).

5. Empirical Performance Across Domains

Language: Adapter-tuned models routinely match or slightly underperform full fine-tuning on large-scale NLU benchmarks (GLUE, SuperGLUE) within 0.5–2.0 points, but show marked superiority in low-resource (He et al., 2021, Chen et al., 2024), cross-lingual (Le et al., 2021), and multi-task settings (Gong et al., 3 Sep 2025, Son et al., 2024).

Speech: Adapter-based methods in ASR and speech processing deliver WER reductions (12.2% average in challenging multilingual dictation for 0.4% per-language parameters (Bai et al., 2024)), outperform or match full fine-tuning across ASR, speaker/intent/emotion tasks, and enable rapid, modular adaptation (Hsieh et al., 2022, Inoue et al., 2024, Suresh et al., 2024).

Vision and Vision-Language: In visual tasks, advanced adapter designs such as Mona and Adapter-X match or exceed full fine-tuning in image classification, detection, and segmentation—sometimes at less than 2% of trainable parameters (Yin et al., 2023, Li et al., 2024). For VLMs and segmentation, VLSM-Adapter and R-Adapter enable robust, OOD-resistant finetuning with strong gains in both data-rich and few-shot/zero-shot settings (Dhakal et al., 2024, Kim et al., 2024).

Multimodal/Few-shot/Hierarchical: Adapter-based finetuning paired with attribute prompts and hierarchical regularization achieves state-of-the-art on few-shot VLM transfer and robust multimodal alignment (Zhao et al., 15 Aug 2025). Gate-controlled, structure-learning adapters yield superior accuracy and task-dependent efficiency (Gong et al., 3 Sep 2025).

Summary of typical quantitative results (accuracy/f1/BLEU, parameter fraction):

Model/task Adapter perf. Full-tune perf. Adapter param % Source
RoBERTa-base, GLUE (avg) 85.6 86.4 8.9 (Chen et al., 2024)
ELECTRA, SuperGLUE 0.782 0.750 2–5 (Siddiqui et al., 14 Jan 2025)
WavLM ASR, WER (%) 9.39 9.41 10 (Inoue et al., 2024)
Mona, COCO instance seg. AP=53.4 AP=52.4 4.7 (Yin et al., 2023)
Adapter-X, VTAB 76.2 68.9 0.2 (Li et al., 2024)
CLIP, ImageNet OOD acc. 54.3 44.2 13 (Kim et al., 2024)

6. Limitations and Trade-offs

Despite strong parameter efficiency, adapters can incur higher training compute and slightly increased inference latency versus full fine-tuning for moderate-size models (up to several hundred million parameters), mainly due to non-trivial backward passes through each adapter (Mundra et al., 2023). In these regimes, multi-task full fine-tuning may match or surpass adapters in total resource cost and maintainability. For extremely large models (LLMs, ViTs), adapter-based approaches remain the only tractable solution for scalable, modular, and continually adaptive finetuning.

Certain tasks and domains—especially extremely small data settings or those demanding architectural reconfiguration—may require refined adapter placement, hybrid PEFT, or adapters with dynamic insertion and activation (Gong et al., 3 Sep 2025, Li et al., 2024).

Implementation is supported by libraries such as AdapterHub, HuggingFace Transformers, and task-specific frameworks. Best practices include:

Recent architectural and analytical advances include frequency-aware adapters (FAA) with dynamic channel modulation (Bae et al., 26 Dec 2025), hyperbolic attribute bridging for one-to-many VLM mapping (Zhao et al., 15 Aug 2025), and unified adapters for multi-task and continual learning (Inoue et al., 2024, Son et al., 2024).

References

  • "MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning" (Eichenberg et al., 2021)
  • "Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers" (Hsieh et al., 2022)
  • "Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient LLMs" (Gong et al., 3 Sep 2025)
  • "ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks" (Inoue et al., 2024)
  • "Adapter is All You Need for Tuning Visual Tasks" (Yin et al., 2023)
  • "Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning" (Zhao et al., 15 Aug 2025)
  • "VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks" (Dhakal et al., 2024)
  • "Lightweight Adapter Tuning for Multilingual Speech Translation" (Le et al., 2021)
  • "Parameter-Efficient Fine-Tuning With Adapters" (Chen et al., 2024)
  • "Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision" (Li et al., 2024)
  • "AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks" (Liang et al., 2024)
  • "A Comprehensive Analysis of Adapter Efficiency" (Mundra et al., 2023)
  • "Towards Efficient Post-Training via Fourier-Driven Adapter Architectures" (Bae et al., 26 Dec 2025)
  • "Not All Adapters Matter: Selective Adapter Freezing for Memory-Efficient Fine-Tuning of LLMs" (Son et al., 2024)
  • "An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks" (Suresh et al., 2024)
  • "Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR" (Bai et al., 2024)
  • "Comparative Analysis of Efficient Adapter-Based Fine-Tuning of State-of-the-Art Transformer Models" (Siddiqui et al., 14 Jan 2025)
  • "On the Effectiveness of Adapter-based Tuning for Pretrained LLM Adaptation" (He et al., 2021)
  • "Efficient and Versatile Robust Fine-Tuning of Zero-shot Models" (Kim et al., 2024)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adapter-based Finetuning.