Papers
Topics
Authors
Recent
Search
2000 character limit reached

MDAPT with Prototype-Based HyperAdapters

Updated 11 December 2025
  • The paper introduces Prototype-Based HyperAdapters to generate adapters on-the-fly, drastically reducing per-task storage while achieving high performance.
  • It employs instance-dense retrieval and contrastive losses to cluster and refine domain prototypes for effective multi-domain adaptation.
  • Empirical results show that the method maintains robust performance with just 616K parameters versus 220M in full fine-tuning, excelling in low-data settings.

Adapter-Based Parameter-Efficient MDAPT—Prototype-Based HyperAdapters

Adapter-based parameter-efficient Multi-Domain AdapTation (MDAPT) using Prototype-Based HyperAdapters (PHA) is a method for efficiently adapting pre-trained LLMs (PLMs) to a growing set of distinct domains or tasks. It achieves this by replacing exhaustive model fine-tuning with a small, dynamically generated set of trainable parameters via a hypernetwork and prototype-driven retrieval, resulting in strong generalization and sample efficiency, especially in low-data and multi-domain settings (Zhao et al., 2023).

1. Architectural Foundations of Prototype-Based HyperAdapters

PHA extends classical adapter-tuning by inserting lightweight bottleneck adapters into every transformer layer, but avoids per-task storage and retraining by generating these adapters on-the-fly through a shared prototypical hypernetwork. For PLMs like T5-Base (θ∈R220M\theta \in \mathbb{R}^{220M}), the backbone (encoder, decoder) remains frozen, while each layer mm contains a two-layer adapter AmA^m controlled by:

  • Down-projection Um∈Rb×dU^m \in \mathbb{R}^{b \times d}
  • Nonlinearity (ReLU)
  • Up-projection Dm∈Rd×bD^m \in \mathbb{R}^{d \times b}

Here, dd is the hidden size (e.g., 768), bb is the adapter bottleneck (b≪db \ll d). Adapter weights per layer and task/domain are not stored explicitly; instead, they are generated by a single hypernetwork HwH_w which takes as input a learned task/domain prototype ki∈Rd′k_i \in \mathbb{R}^{d'} and layer embedding mm0.

This design drastically reduces the parameter count required as the number of tasks/domains grows: total storage scales as mm1 (prototypes, layer embeddings, hypernetwork), compared to mm2 for separate per-task/domain adapters, where mm3 is the number of domains/tasks and mm4 the number of transformer layers.

2. Instance-Dense Retrieval and Prototype Learning

Task/domain prototypes are discovered and refined via an instance-dense retriever:

  • Each instance mm5 is mapped to a latent embedding mm6, with mm7 and mm8 a dense MLP.
  • Retrieval vectors are supervised by the InfoNCE contrastive loss mm9 to enforce that instance vectors from the same task/domain cluster in latent space, repelling vectors from different tasks/domains:

AmA^m0

with AmA^m1.

  • Prototype vectors AmA^m2 are further optimized via a separate prototypical contrastive loss AmA^m3 to enhance their representativity and discrimination.

At inference, a new instance is mapped into the same latent space, and the nearest (in cosine similarity) or top-AmA^m4 prototypes are selected to condition the hypernetwork, allowing rapid domain selection—even in the absence of explicit domain labels.

3. Adapter Parameter Generation via Prototypical Hypernetwork

  • Each stabilized prototype AmA^m5 is concatenated with a layer embedding AmA^m6 and projected before being passed to the hypernetwork AmA^m7 to yield domain- and layer-specific adapter parameters AmA^m8.
  • The layer output is modified as:

AmA^m9

  • Only Um∈Rb×dU^m \in \mathbb{R}^{b \times d}0, the set of prototypes Um∈Rb×dU^m \in \mathbb{R}^{b \times d}1, and layer embeddings Um∈Rb×dU^m \in \mathbb{R}^{b \times d}2 are stored; the full adapter parameter tensors are generated at run-time.

4. Parameter and Storage Efficiency

The crucial efficiency properties are as follows:

Approach Trainable Params % of FF Tune (T5-Base) Avg. GLUE+SG Score
Full fine-tune 220M 100% 84.9%
Standard adapters 1.9M 0.86% 84.5%
Hyperformer++ 638K 0.29% 84.7%
HyperDecoder 1.8M 0.82% 83.7%
PHA (MDAPT, 12 tasks) 616K 0.28% 85.5%

At scale (e.g., Um∈Rb×dU^m \in \mathbb{R}^{b \times d}3 tasks/domains, Um∈Rb×dU^m \in \mathbb{R}^{b \times d}4 layers, Um∈Rb×dU^m \in \mathbb{R}^{b \times d}5), PHA requires only Um∈Rb×dU^m \in \mathbb{R}^{b \times d}6616K trainable parameters, a 0.28% fraction of full fine-tuning.

5. Multi-Domain Adaptation Mechanism

Transitioning from multi-task to multi-domain settings, each domain is treated as a "task," and domain prototypes are learned from in-domain (either labeled or unlabeled) data with the same InfoNCE and prototypical contrastive losses:

  • Prototype formation: Domain prototypes Um∈Rb×dU^m \in \mathbb{R}^{b \times d}7 capture the key characteristics of each domain.
  • Domain retrieval: For unlabeled or mixed-domain data, domain assignment at inference is achieved by mapping each instance into the prototype space for retrieval and selection.
  • Top-K prototype mixture: For instances spanning multiple domains, a weighted sum of the top-K prototypes can be fed to the hypernetwork.
  • Continuous/online domain shift: Prototypes can be updated online to reflect emerging domains via a maintained buffer and small learning rate.

Benefits include: (1) maximal reuse of cross-domain knowledge in the hypernetwork, (2) almost negligible per-domain storage (one vector per domain), and (3) immediate adaptation with no backbone fine-tuning.

6. Training Objectives and Sample Efficiency

The end-to-end training objective for PHA-based MDAPT is:

Um∈Rb×dU^m \in \mathbb{R}^{b \times d}8

where Um∈Rb×dU^m \in \mathbb{R}^{b \times d}9 is the supervised cross-entropy loss, Dm∈Rd×bD^m \in \mathbb{R}^{d \times b}0 is the retriever contrastive loss, Dm∈Rd×bD^m \in \mathbb{R}^{d \times b}1 is the prototypical embedding loss, and Dm∈Rd×bD^m \in \mathbb{R}^{d \times b}2 (default 0.1) balances sample efficiency and representation quality.

Empirically, PHA yields state-of-the-art robustness and efficiency in low-data settings:

  • On 100 samples/task regimes, PHA attains 80% accuracy vs. 72% for classic adapters and 70% for Hyperformer.
  • For few-shot transfer (Dm∈Rd×bD^m \in \mathbb{R}^{d \times b}3), PHA achieves 68–88% accuracy (+3–20% absolute improvement versus baselines).
  • As the available data declines from full to 1%, PHA maintains a 5–10% absolute improvement in downstream metrics (Zhao et al., 2023).

7. Practical Considerations and Extensions

Key design decisions and observations for adapter-based parameter-efficient MDAPT via PHA include:

  • Storage and deployment: Models require only the frozen backbone, the hypernetwork, and a prototype vector per domain; domain adaptation is a matter of swapping in the appropriate prototype.
  • Mixed-domain and online settings: Top-K prototype mixtures enable adaptation to samples with ambiguous or hybrid domain membership; buffer-based prototype updating supports streaming or evolving domains.
  • Training and convergence: PHA achieves 10–15% faster convergence and 5–7% absolute accuracy improvements relative to adapters or domain-specific full fine-tuning in sentiment classification domains, while maintaining a per-domain parameter update of Dm∈Rd×bD^m \in \mathbb{R}^{d \times b}4.

PHA-based MDAPT delivers a system in which (a) adapters are not explicitly stored per domain—a fundamental scalability improvement, (b) hypernetwork-based parameterization allows fully on-the-fly instantiation of domain adapters, and (c) sample efficiency is enhanced by prototype-driven retrieval and loss design.

References

  • Prototype-based HyperAdapter for Sample-Efficient Multi-task Tuning (Zhao et al., 2023)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adapter-Based Parameter-Efficient MDAPT.