MDAPT with Prototype-Based HyperAdapters
- The paper introduces Prototype-Based HyperAdapters to generate adapters on-the-fly, drastically reducing per-task storage while achieving high performance.
- It employs instance-dense retrieval and contrastive losses to cluster and refine domain prototypes for effective multi-domain adaptation.
- Empirical results show that the method maintains robust performance with just 616K parameters versus 220M in full fine-tuning, excelling in low-data settings.
Adapter-Based Parameter-Efficient MDAPT—Prototype-Based HyperAdapters
Adapter-based parameter-efficient Multi-Domain AdapTation (MDAPT) using Prototype-Based HyperAdapters (PHA) is a method for efficiently adapting pre-trained LLMs (PLMs) to a growing set of distinct domains or tasks. It achieves this by replacing exhaustive model fine-tuning with a small, dynamically generated set of trainable parameters via a hypernetwork and prototype-driven retrieval, resulting in strong generalization and sample efficiency, especially in low-data and multi-domain settings (Zhao et al., 2023).
1. Architectural Foundations of Prototype-Based HyperAdapters
PHA extends classical adapter-tuning by inserting lightweight bottleneck adapters into every transformer layer, but avoids per-task storage and retraining by generating these adapters on-the-fly through a shared prototypical hypernetwork. For PLMs like T5-Base (), the backbone (encoder, decoder) remains frozen, while each layer contains a two-layer adapter controlled by:
- Down-projection
- Nonlinearity (ReLU)
- Up-projection
Here, is the hidden size (e.g., 768), is the adapter bottleneck (). Adapter weights per layer and task/domain are not stored explicitly; instead, they are generated by a single hypernetwork which takes as input a learned task/domain prototype and layer embedding .
This design drastically reduces the parameter count required as the number of tasks/domains grows: total storage scales as (prototypes, layer embeddings, hypernetwork), compared to for separate per-task/domain adapters, where is the number of domains/tasks and the number of transformer layers.
2. Instance-Dense Retrieval and Prototype Learning
Task/domain prototypes are discovered and refined via an instance-dense retriever:
- Each instance is mapped to a latent embedding , with and a dense MLP.
- Retrieval vectors are supervised by the InfoNCE contrastive loss to enforce that instance vectors from the same task/domain cluster in latent space, repelling vectors from different tasks/domains:
with .
- Prototype vectors are further optimized via a separate prototypical contrastive loss to enhance their representativity and discrimination.
At inference, a new instance is mapped into the same latent space, and the nearest (in cosine similarity) or top- prototypes are selected to condition the hypernetwork, allowing rapid domain selection—even in the absence of explicit domain labels.
3. Adapter Parameter Generation via Prototypical Hypernetwork
- Each stabilized prototype is concatenated with a layer embedding and projected before being passed to the hypernetwork to yield domain- and layer-specific adapter parameters .
- The layer output is modified as:
- Only , the set of prototypes , and layer embeddings are stored; the full adapter parameter tensors are generated at run-time.
4. Parameter and Storage Efficiency
The crucial efficiency properties are as follows:
| Approach | Trainable Params | % of FF Tune (T5-Base) | Avg. GLUE+SG Score |
|---|---|---|---|
| Full fine-tune | 220M | 100% | 84.9% |
| Standard adapters | 1.9M | 0.86% | 84.5% |
| Hyperformer++ | 638K | 0.29% | 84.7% |
| HyperDecoder | 1.8M | 0.82% | 83.7% |
| PHA (MDAPT, 12 tasks) | 616K | 0.28% | 85.5% |
At scale (e.g., tasks/domains, layers, ), PHA requires only 616K trainable parameters, a 0.28% fraction of full fine-tuning.
5. Multi-Domain Adaptation Mechanism
Transitioning from multi-task to multi-domain settings, each domain is treated as a "task," and domain prototypes are learned from in-domain (either labeled or unlabeled) data with the same InfoNCE and prototypical contrastive losses:
- Prototype formation: Domain prototypes capture the key characteristics of each domain.
- Domain retrieval: For unlabeled or mixed-domain data, domain assignment at inference is achieved by mapping each instance into the prototype space for retrieval and selection.
- Top-K prototype mixture: For instances spanning multiple domains, a weighted sum of the top-K prototypes can be fed to the hypernetwork.
- Continuous/online domain shift: Prototypes can be updated online to reflect emerging domains via a maintained buffer and small learning rate.
Benefits include: (1) maximal reuse of cross-domain knowledge in the hypernetwork, (2) almost negligible per-domain storage (one vector per domain), and (3) immediate adaptation with no backbone fine-tuning.
6. Training Objectives and Sample Efficiency
The end-to-end training objective for PHA-based MDAPT is:
where is the supervised cross-entropy loss, is the retriever contrastive loss, is the prototypical embedding loss, and (default 0.1) balances sample efficiency and representation quality.
Empirically, PHA yields state-of-the-art robustness and efficiency in low-data settings:
- On 100 samples/task regimes, PHA attains 80% accuracy vs. 72% for classic adapters and 70% for Hyperformer.
- For few-shot transfer (), PHA achieves 68–88% accuracy (+3–20% absolute improvement versus baselines).
- As the available data declines from full to 1%, PHA maintains a 5–10% absolute improvement in downstream metrics (Zhao et al., 2023).
7. Practical Considerations and Extensions
Key design decisions and observations for adapter-based parameter-efficient MDAPT via PHA include:
- Storage and deployment: Models require only the frozen backbone, the hypernetwork, and a prototype vector per domain; domain adaptation is a matter of swapping in the appropriate prototype.
- Mixed-domain and online settings: Top-K prototype mixtures enable adaptation to samples with ambiguous or hybrid domain membership; buffer-based prototype updating supports streaming or evolving domains.
- Training and convergence: PHA achieves 10–15% faster convergence and 5–7% absolute accuracy improvements relative to adapters or domain-specific full fine-tuning in sentiment classification domains, while maintaining a per-domain parameter update of .
PHA-based MDAPT delivers a system in which (a) adapters are not explicitly stored per domain—a fundamental scalability improvement, (b) hypernetwork-based parameterization allows fully on-the-fly instantiation of domain adapters, and (c) sample efficiency is enhanced by prototype-driven retrieval and loss design.
References
- Prototype-based HyperAdapter for Sample-Efficient Multi-task Tuning (Zhao et al., 2023)