Mixture-of-Prompts Distillation (MoPD)

Updated 16 January 2026

The paper introduces MoPD, a novel framework that transfers class-discriminative knowledge via gating-based distillation to mitigate seen-class overfitting in soft prompt learning.
MoPD integrates a composite loss (cross-entropy, MPD, and MPS) and dynamic hard prompt selection to enhance model generalization on unseen classes.
Experimental results across 11 datasets show that MoPD outperforms baseline methods like CoOp and CoCoOp in base-to-new, few-shot, and domain generalization tasks.

Mixture-of-Prompts Distillation (MoPD) introduces a knowledge transfer framework for soft prompt learning in vision-LLMs (VLMs). Targeting the persistent issue of seen-class overfitting in soft prompt adaptation, MoPD utilizes a gating-based distillation procedure to transfer knowledge from a curated mixture of hard, hand-crafted prompts (“teacher prompts”) to a learnable soft prompt (“student prompt”). This mechanism demonstrably improves generalization to unseen classes and outperforms state-of-the-art baseline methods in standard and challenging vision-language benchmarks (Chen et al., 2024).

1. Motivation and Problem Statement

Vision-LLMs like CLIP exhibit notable zero-shot classification performance when prompted by carefully designed hard prompts (e.g., “a photo of a [CLASS]”). Soft prompt learning methods, such as CoOp and CoCoOp, instead employ trainable prompt vectors but are prone to overfitting to base (“seen”) classes, resulting in limited transferability to novel (“unseen”) classes. This overfitting is attributed to inherent bias in few-shot training data distributions, which predominantly feature base classes. MoPD addresses this limitation by enabling effective transfer of class-discriminative knowledge from a diverse set of hand-crafted prompts to the learnable soft prompt via a controlled, sample-specific distillation process, thereby enhancing generalization performance (Chen et al., 2024).

2. Architecture and Workflow

The MoPD architecture introduces four major components:

Pre-trained CLIP backbone: The image and text encoders are kept frozen throughout adaptation.
Hard prompt pool: A set of $H$ hard prompts, denoted $\{\text{Prompt}_{\text{hard},t}\}_{t=1}^H$ , typically obtained via template or synonym variation, each inducing class embeddings $\mathbf{t}_{\text{hard},t}^c$ through the CLIP text encoder for class $c$ .
Soft prompt: A learnable sequence $V = [v_1,\dots,v_M,\text{[CLASS]}]$ of $M$ prompt vectors, prepended to each class name and encoded to produce $\mathbf{t}_{\text{soft}}^c$ .
Gating network $G$ : A single fully-connected layer with weights $W_g \in \mathbb{R}^{d \times H}$ , taking an image embedding $f \in \mathbb{R}^d$ as input and selecting a sample-specific mixture of $T$ hard prompts using a sparse softmax over the top- $T$ gating scores.

The overall forward pass for an image $x$ with label $y$ proceeds as follows:

$f = \text{ImageEncoder}(x)$ ;
Compute soft-prompt class probabilities $p_{\text{soft}}(y|x)$ via cosine similarity against $\mathbf{t}_{\text{soft}}^c$ with temperature $\tau$ ;
For each teacher prompt $t$ , compute $p_{\text{hard},t}(y|x)$ similarly;
Compute gating weights $w_t$ for the top- $T$ hard prompts;
Evaluate and backpropagate losses into the soft prompt and gating network only, with frozen encoders (Chen et al., 2024).

3. Mathematical Formulation

Let $p_{\text{soft}}(c|x)$ and $p_{\text{hard},t}(c|x)$ denote the predicted distributions over class $c$ using the soft prompt and the $t$ -th hard prompt, respectively:

$p_{\text{soft}}(c|x) = \frac{\exp\left(\text{cos}(\mathbf{t}_{\text{soft}}^c, f)/\tau\right)}{\sum_{c'} \exp\left(\text{cos}(\mathbf{t}_{\text{soft}}^{c'}, f)/\tau\right)},$

$p_{\text{hard},t}(c|x) = \frac{\exp\left(\text{cos}(\mathbf{t}_{\text{hard},t}^c, f)/\tau\right)}{\sum_{c'} \exp\left(\text{cos}(\mathbf{t}_{\text{hard},t}^{c'}, f)/\tau\right)}.$

The composite MoPD loss combines three terms:

Cross-entropy on base classes:

$L_{\text{CE}} = -\sum_{(x,y)\in D}\log p_{\text{soft}}(y|x)$

Mixture-of-prompts distillation (MPD) loss:

$L_{\text{MPD}} = \sum_{x \in D} \sum_{t=1}^H w_t \, \text{KL}(p_{\text{soft}}(\cdot|x)\,\|\, p_{\text{hard},t}(\cdot|x))$

Mixture-of-prompts selection (MPS) loss:

$L_{\text{MPS}} = -\sum_{(x,y)\in D} \sum_{t=1}^H w_t \log p_{\text{hard},t}(y|x)$

The full objective is:

$L = \alpha \cdot L_{\text{CE}} + (1-\alpha) \cdot L_{\text{MPD}} + \beta \cdot L_{\text{MPS}}$

where $\alpha, \beta \in [0, 1]$ are empirically selected trade-off coefficients (Chen et al., 2024).

Mixture weights $w$ are generated by applying a softmax to only the top- $T$ entries of the gating network’s outputs:

$g = f \cdot W_g$ ,
$w_i = \exp(g_i) / \sum_j \exp(g_j)$ , for $i$ among the top- $T$ entries, zero otherwise.

4. Training Procedure and Implementation Details

Backbones: ViT-B/16 image encoder and transformer text encoder (frozen).
Soft prompt: Length $M=4$ , initialized to “a photo of a”.
Hard prompt pool: $H=12$ hard prompts (dataset-dependent).
Gating: Single FC layer; top- $T$ prompt selection, $T=2\text{ or }3$ .
Training: Parameters $\alpha=0.8$ (or $0.5$ depending on dataset, e.g., HICO), $\beta=5\times10^{-4}$ (or $1.0$ for certain tasks).
Optimization: AdamW, learning rate $\approx 0.002$ , 50 epochs for base-to-new, following the CoOp protocol for few-shot settings.
Evaluation:
- Base-to-new: train on base (16 shots/class), test on base and unseen; report $acc_{base}$ , $acc_{new}$ , $H = 2/(1/acc_{base} + 1/acc_{new})$ .
- Few-shot: end-to-end training/testing on same classes for 1/2/4/8/16 shots.
- Domain generalization: ImageNet-16-shot training, tested on ImageNet {V2, Sketch, A, R} (Chen et al., 2024).

5. Experimental Results

Base-to-New Generalization

Across 11 datasets with 16-shot base training, MoPD achieves:

Method	$acc_{base}$	$acc_{new}$	$H$
CoOp	82.64	68.00	74.61
CoCoOp	–	–	75.83
ProGrad	–	–	76.15
KgCoOp	–	–	77.00
MoPD	81.40	74.69	77.90

MoPD attains the highest average $H$ and $acc_{new}$ , outperforming baselines on unseen classes by 1.1–2.5 points and achieving the highest harmonic mean in 10/11 datasets.

Few-Shot and Domain Generalization

MoPD consistently achieves the highest average accuracy in 1/2/4/8/16-shot classification across 11 datasets, demonstrating robustness in extremely limited data regimes.

For ImageNet-to-variant transfer, MoPD yields an average target accuracy of 60.29, outperforming KgCoOp (60.11), and shows particular improvements on Sketch (+0.13) and A (+0.08).

Ablation and Robustness

Single-prompt distillation (SiPD) vs. CoOp: $H$ improves from 74.61 to 77.04 (acc_base drops marginally).
MoPD (multi-prompt) vs. SiPD: $H$ increases by 0.86, $acc_{new}$ by 1.34.
Omitting MPS loss: $H$ decreases by 0.21.
Random prompt selection (MoPD-R) vs. learned gating: $H$ decreases by 0.64, $acc_{new}$ by 1.17.
MoPD remains robust under noise: with up to 24 noisy templates in a pool of 36, the gating network maintains $H$ within 0.1–0.3 points of clean pool performance, whereas random selection degrades $H$ by more than 2 points.

6. Impact and Comparison to Prior Work

MoPD achieves substantial improvements over existing soft prompt learning methods in terms of generalization to unseen classes. Prior art such as CoOp, CoCoOp, ProGrad, and KgCoOp suffer from seen-class bias under few-shot training due to insufficient knowledge transfer from hand-crafted prompt variations. By distilling knowledge from a dynamically selected, data-dependent mixture of hard prompts via a dedicated gating mechanism, MoPD demonstrates that soft prompt generalization can be systematically improved via mixture-based prompt distillation, rather than mere joint training or random hard prompt selection.

A plausible implication is that adaptive prompt selection—rather than static ensembling or random selection—is essential for robust transfer to out-of-distribution and unseen scenarios (Chen et al., 2024).

7. Conclusion and Future Directions

By enabling image-specific distillation from a select subset of hard prompts, Mixture-of-Prompts Distillation addresses core limitations of soft prompt adaptation in VLMs, particularly overfitting to seen classes and reduced generalization to novel categories. The competitive results across base-to-new, few-shot, and domain-generalization tasks establish MoPD as the new state-of-the-art in soft prompt learning under data scarcity and domain shift (Chen et al., 2024). This suggests future research directions in more advanced gating mechanisms, scalable prompt pools, and extension to non-vision-language architectures.

Markdown Report Issue Upgrade to Chat

References (1)

MoPD: Mixture-of-Prompts Distillation for Vision-Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Prompts Distillation (MoPD).

Mixture-of-Prompts Distillation (MoPD)

1. Motivation and Problem Statement

2. Architecture and Workflow

3. Mathematical Formulation

4. Training Procedure and Implementation Details

5. Experimental Results

Base-to-New Generalization

Few-Shot and Domain Generalization

Ablation and Robustness

6. Impact and Comparison to Prior Work

7. Conclusion and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Mixture-of-Prompts Distillation (MoPD)

1. Motivation and Problem Statement

2. Architecture and Workflow

3. Mathematical Formulation

4. Training Procedure and Implementation Details

5. Experimental Results

Base-to-New Generalization

Few-Shot and Domain Generalization

Ablation and Robustness

6. Impact and Comparison to Prior Work

7. Conclusion and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research