Mixture-of-Prompts Distillation (MoPD)
- The paper introduces MoPD, a novel framework that transfers class-discriminative knowledge via gating-based distillation to mitigate seen-class overfitting in soft prompt learning.
- MoPD integrates a composite loss (cross-entropy, MPD, and MPS) and dynamic hard prompt selection to enhance model generalization on unseen classes.
- Experimental results across 11 datasets show that MoPD outperforms baseline methods like CoOp and CoCoOp in base-to-new, few-shot, and domain generalization tasks.
Mixture-of-Prompts Distillation (MoPD) introduces a knowledge transfer framework for soft prompt learning in vision-LLMs (VLMs). Targeting the persistent issue of seen-class overfitting in soft prompt adaptation, MoPD utilizes a gating-based distillation procedure to transfer knowledge from a curated mixture of hard, hand-crafted prompts (“teacher prompts”) to a learnable soft prompt (“student prompt”). This mechanism demonstrably improves generalization to unseen classes and outperforms state-of-the-art baseline methods in standard and challenging vision-language benchmarks (Chen et al., 2024).
1. Motivation and Problem Statement
Vision-LLMs like CLIP exhibit notable zero-shot classification performance when prompted by carefully designed hard prompts (e.g., “a photo of a [CLASS]”). Soft prompt learning methods, such as CoOp and CoCoOp, instead employ trainable prompt vectors but are prone to overfitting to base (“seen”) classes, resulting in limited transferability to novel (“unseen”) classes. This overfitting is attributed to inherent bias in few-shot training data distributions, which predominantly feature base classes. MoPD addresses this limitation by enabling effective transfer of class-discriminative knowledge from a diverse set of hand-crafted prompts to the learnable soft prompt via a controlled, sample-specific distillation process, thereby enhancing generalization performance (Chen et al., 2024).
2. Architecture and Workflow
The MoPD architecture introduces four major components:
- Pre-trained CLIP backbone: The image and text encoders are kept frozen throughout adaptation.
- Hard prompt pool: A set of hard prompts, denoted , typically obtained via template or synonym variation, each inducing class embeddings through the CLIP text encoder for class .
- Soft prompt: A learnable sequence of prompt vectors, prepended to each class name and encoded to produce .
- Gating network : A single fully-connected layer with weights , taking an image embedding as input and selecting a sample-specific mixture of hard prompts using a sparse softmax over the top- gating scores.
The overall forward pass for an image with label proceeds as follows:
- ;
- Compute soft-prompt class probabilities via cosine similarity against with temperature ;
- For each teacher prompt , compute similarly;
- Compute gating weights for the top- hard prompts;
- Evaluate and backpropagate losses into the soft prompt and gating network only, with frozen encoders (Chen et al., 2024).
3. Mathematical Formulation
Let and denote the predicted distributions over class using the soft prompt and the -th hard prompt, respectively:
The composite MoPD loss combines three terms:
- Cross-entropy on base classes:
- Mixture-of-prompts distillation (MPD) loss:
- Mixture-of-prompts selection (MPS) loss:
The full objective is:
where are empirically selected trade-off coefficients (Chen et al., 2024).
Mixture weights are generated by applying a softmax to only the top- entries of the gating network’s outputs:
- ,
- , for among the top- entries, zero otherwise.
4. Training Procedure and Implementation Details
- Backbones: ViT-B/16 image encoder and transformer text encoder (frozen).
- Soft prompt: Length , initialized to “a photo of a”.
- Hard prompt pool: hard prompts (dataset-dependent).
- Gating: Single FC layer; top- prompt selection, .
- Training: Parameters (or $0.5$ depending on dataset, e.g., HICO), (or $1.0$ for certain tasks).
- Optimization: AdamW, learning rate , 50 epochs for base-to-new, following the CoOp protocol for few-shot settings.
- Evaluation:
- Base-to-new: train on base (16 shots/class), test on base and unseen; report , , .
- Few-shot: end-to-end training/testing on same classes for 1/2/4/8/16 shots.
- Domain generalization: ImageNet-16-shot training, tested on ImageNet {V2, Sketch, A, R} (Chen et al., 2024).
5. Experimental Results
Base-to-New Generalization
Across 11 datasets with 16-shot base training, MoPD achieves:
| Method | |||
|---|---|---|---|
| CoOp | 82.64 | 68.00 | 74.61 |
| CoCoOp | – | – | 75.83 |
| ProGrad | – | – | 76.15 |
| KgCoOp | – | – | 77.00 |
| MoPD | 81.40 | 74.69 | 77.90 |
MoPD attains the highest average and , outperforming baselines on unseen classes by 1.1–2.5 points and achieving the highest harmonic mean in 10/11 datasets.
Few-Shot and Domain Generalization
MoPD consistently achieves the highest average accuracy in 1/2/4/8/16-shot classification across 11 datasets, demonstrating robustness in extremely limited data regimes.
For ImageNet-to-variant transfer, MoPD yields an average target accuracy of 60.29, outperforming KgCoOp (60.11), and shows particular improvements on Sketch (+0.13) and A (+0.08).
Ablation and Robustness
- Single-prompt distillation (SiPD) vs. CoOp: improves from 74.61 to 77.04 (acc_base drops marginally).
- MoPD (multi-prompt) vs. SiPD: increases by 0.86, by 1.34.
- Omitting MPS loss: decreases by 0.21.
- Random prompt selection (MoPD-R) vs. learned gating: decreases by 0.64, by 1.17.
- MoPD remains robust under noise: with up to 24 noisy templates in a pool of 36, the gating network maintains within 0.1–0.3 points of clean pool performance, whereas random selection degrades by more than 2 points.
6. Impact and Comparison to Prior Work
MoPD achieves substantial improvements over existing soft prompt learning methods in terms of generalization to unseen classes. Prior art such as CoOp, CoCoOp, ProGrad, and KgCoOp suffer from seen-class bias under few-shot training due to insufficient knowledge transfer from hand-crafted prompt variations. By distilling knowledge from a dynamically selected, data-dependent mixture of hard prompts via a dedicated gating mechanism, MoPD demonstrates that soft prompt generalization can be systematically improved via mixture-based prompt distillation, rather than mere joint training or random hard prompt selection.
A plausible implication is that adaptive prompt selection—rather than static ensembling or random selection—is essential for robust transfer to out-of-distribution and unseen scenarios (Chen et al., 2024).
7. Conclusion and Future Directions
By enabling image-specific distillation from a select subset of hard prompts, Mixture-of-Prompts Distillation addresses core limitations of soft prompt adaptation in VLMs, particularly overfitting to seen classes and reduced generalization to novel categories. The competitive results across base-to-new, few-shot, and domain-generalization tasks establish MoPD as the new state-of-the-art in soft prompt learning under data scarcity and domain shift (Chen et al., 2024). This suggests future research directions in more advanced gating mechanisms, scalable prompt pools, and extension to non-vision-language architectures.