Papers
Topics
Authors
Recent
Search
2000 character limit reached

MDAPT with Prototype-Based HyperAdapters

Updated 11 December 2025
  • The paper introduces Prototype-Based HyperAdapters to generate adapters on-the-fly, drastically reducing per-task storage while achieving high performance.
  • It employs instance-dense retrieval and contrastive losses to cluster and refine domain prototypes for effective multi-domain adaptation.
  • Empirical results show that the method maintains robust performance with just 616K parameters versus 220M in full fine-tuning, excelling in low-data settings.

Adapter-Based Parameter-Efficient MDAPT—Prototype-Based HyperAdapters

Adapter-based parameter-efficient Multi-Domain AdapTation (MDAPT) using Prototype-Based HyperAdapters (PHA) is a method for efficiently adapting pre-trained LLMs (PLMs) to a growing set of distinct domains or tasks. It achieves this by replacing exhaustive model fine-tuning with a small, dynamically generated set of trainable parameters via a hypernetwork and prototype-driven retrieval, resulting in strong generalization and sample efficiency, especially in low-data and multi-domain settings (Zhao et al., 2023).

1. Architectural Foundations of Prototype-Based HyperAdapters

PHA extends classical adapter-tuning by inserting lightweight bottleneck adapters into every transformer layer, but avoids per-task storage and retraining by generating these adapters on-the-fly through a shared prototypical hypernetwork. For PLMs like T5-Base (θR220M\theta \in \mathbb{R}^{220M}), the backbone (encoder, decoder) remains frozen, while each layer mm contains a two-layer adapter AmA^m controlled by:

  • Down-projection UmRb×dU^m \in \mathbb{R}^{b \times d}
  • Nonlinearity (ReLU)
  • Up-projection DmRd×bD^m \in \mathbb{R}^{d \times b}

Here, dd is the hidden size (e.g., 768), bb is the adapter bottleneck (bdb \ll d). Adapter weights per layer and task/domain are not stored explicitly; instead, they are generated by a single hypernetwork HwH_w which takes as input a learned task/domain prototype kiRdk_i \in \mathbb{R}^{d'} and layer embedding eme_m.

This design drastically reduces the parameter count required as the number of tasks/domains grows: total storage scales as O(τd+Ld+H)O(\tau d' + L d' + |H|) (prototypes, layer embeddings, hypernetwork), compared to O(τdL)O(\tau d L) for separate per-task/domain adapters, where τ\tau is the number of domains/tasks and LL the number of transformer layers.

2. Instance-Dense Retrieval and Prototype Learning

Task/domain prototypes are discovered and refined via an instance-dense retriever:

  • Each instance xx is mapped to a latent embedding z=G(h)z = G(h), with h=Encoderθ(x)h = \mathrm{Encoder}_\theta(x) and GG a dense MLP.
  • Retrieval vectors are supervised by the InfoNCE contrastive loss LIRL_\mathrm{IR} to enforce that instance vectors from the same task/domain cluster in latent space, repelling vectors from different tasks/domains:

LIRi=1Ni1zjD~ilogexp(f(zi,zj))zmS(i)exp(f(zi,zm))L_\mathrm{IR}^i = -\frac{1}{N_i-1} \sum_{z_j \in \tilde{D}_i} \log \frac{\exp(f(z_i, z_j))}{\sum_{z_m \in S(i)} \exp(f(z_i, z_m))}

with f(a,b)=cos(a,b)f(a, b) = \cos(a,b).

  • Prototype vectors kik_i are further optimized via a separate prototypical contrastive loss LProL_\mathrm{Pro} to enhance their representativity and discrimination.

At inference, a new instance is mapped into the same latent space, and the nearest (in cosine similarity) or top-KK prototypes are selected to condition the hypernetwork, allowing rapid domain selection—even in the absence of explicit domain labels.

3. Adapter Parameter Generation via Prototypical Hypernetwork

  • Each stabilized prototype kik_i is concatenated with a layer embedding eme_m and projected before being passed to the hypernetwork HwH_w to yield domain- and layer-specific adapter parameters (Dim,Uim)(D_i^m, U_i^m).
  • The layer output is modified as:

y=FFN(LN(x))+Aim(LN(x)),Aim(x)=Dim(ReLU(Uimx))+x.y = \mathrm{FFN}(\mathrm{LN}(x)) + A_i^m(\mathrm{LN}(x)), \quad A_i^m(x) = D_i^m (\mathrm{ReLU}(U_i^m x)) + x.

  • Only HwH_w, the set of prototypes {ki}\{k_i\}, and layer embeddings {em}\{e_m\} are stored; the full adapter parameter tensors are generated at run-time.

4. Parameter and Storage Efficiency

The crucial efficiency properties are as follows:

Approach Trainable Params % of FF Tune (T5-Base) Avg. GLUE+SG Score
Full fine-tune 220M 100% 84.9%
Standard adapters 1.9M 0.86% 84.5%
Hyperformer++ 638K 0.29% 84.7%
HyperDecoder 1.8M 0.82% 83.7%
PHA (MDAPT, 12 tasks) 616K 0.28% 85.5%

At scale (e.g., τ=12\tau = 12 tasks/domains, L=12L = 12 layers, d=128d' = 128), PHA requires only \sim616K trainable parameters, a 0.28% fraction of full fine-tuning.

5. Multi-Domain Adaptation Mechanism

Transitioning from multi-task to multi-domain settings, each domain is treated as a "task," and domain prototypes are learned from in-domain (either labeled or unlabeled) data with the same InfoNCE and prototypical contrastive losses:

  • Prototype formation: Domain prototypes pdp_d capture the key characteristics of each domain.
  • Domain retrieval: For unlabeled or mixed-domain data, domain assignment at inference is achieved by mapping each instance into the prototype space for retrieval and selection.
  • Top-K prototype mixture: For instances spanning multiple domains, a weighted sum of the top-K prototypes can be fed to the hypernetwork.
  • Continuous/online domain shift: Prototypes can be updated online to reflect emerging domains via a maintained buffer and small learning rate.

Benefits include: (1) maximal reuse of cross-domain knowledge in the hypernetwork, (2) almost negligible per-domain storage (one vector per domain), and (3) immediate adaptation with no backbone fine-tuning.

6. Training Objectives and Sample Efficiency

The end-to-end training objective for PHA-based MDAPT is:

LTotal=LPLM+λ(LIR+LPro)L_\mathrm{Total} = L_\mathrm{PLM} + \lambda \left( L_\mathrm{IR} + L_\mathrm{Pro} \right)

where LPLML_\mathrm{PLM} is the supervised cross-entropy loss, LIRL_\mathrm{IR} is the retriever contrastive loss, LProL_\mathrm{Pro} is the prototypical embedding loss, and λ\lambda (default 0.1) balances sample efficiency and representation quality.

Empirically, PHA yields state-of-the-art robustness and efficiency in low-data settings:

  • On 100 samples/task regimes, PHA attains 80% accuracy vs. 72% for classic adapters and 70% for Hyperformer.
  • For few-shot transfer (k=4,16,32k=4,16,32), PHA achieves 68–88% accuracy (+3–20% absolute improvement versus baselines).
  • As the available data declines from full to 1%, PHA maintains a 5–10% absolute improvement in downstream metrics (Zhao et al., 2023).

7. Practical Considerations and Extensions

Key design decisions and observations for adapter-based parameter-efficient MDAPT via PHA include:

  • Storage and deployment: Models require only the frozen backbone, the hypernetwork, and a prototype vector per domain; domain adaptation is a matter of swapping in the appropriate prototype.
  • Mixed-domain and online settings: Top-K prototype mixtures enable adaptation to samples with ambiguous or hybrid domain membership; buffer-based prototype updating supports streaming or evolving domains.
  • Training and convergence: PHA achieves 10–15% faster convergence and 5–7% absolute accuracy improvements relative to adapters or domain-specific full fine-tuning in sentiment classification domains, while maintaining a per-domain parameter update of <1%<1\%.

PHA-based MDAPT delivers a system in which (a) adapters are not explicitly stored per domain—a fundamental scalability improvement, (b) hypernetwork-based parameterization allows fully on-the-fly instantiation of domain adapters, and (c) sample efficiency is enhanced by prototype-driven retrieval and loss design.

References

  • Prototype-based HyperAdapter for Sample-Efficient Multi-task Tuning (Zhao et al., 2023)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adapter-Based Parameter-Efficient MDAPT.