Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mixture-of-Prompts Distillation (MoPD)

Updated 16 January 2026
  • The paper introduces MoPD, a novel framework that transfers class-discriminative knowledge via gating-based distillation to mitigate seen-class overfitting in soft prompt learning.
  • MoPD integrates a composite loss (cross-entropy, MPD, and MPS) and dynamic hard prompt selection to enhance model generalization on unseen classes.
  • Experimental results across 11 datasets show that MoPD outperforms baseline methods like CoOp and CoCoOp in base-to-new, few-shot, and domain generalization tasks.

Mixture-of-Prompts Distillation (MoPD) introduces a knowledge transfer framework for soft prompt learning in vision-LLMs (VLMs). Targeting the persistent issue of seen-class overfitting in soft prompt adaptation, MoPD utilizes a gating-based distillation procedure to transfer knowledge from a curated mixture of hard, hand-crafted prompts (“teacher prompts”) to a learnable soft prompt (“student prompt”). This mechanism demonstrably improves generalization to unseen classes and outperforms state-of-the-art baseline methods in standard and challenging vision-language benchmarks (Chen et al., 2024).

1. Motivation and Problem Statement

Vision-LLMs like CLIP exhibit notable zero-shot classification performance when prompted by carefully designed hard prompts (e.g., “a photo of a [CLASS]”). Soft prompt learning methods, such as CoOp and CoCoOp, instead employ trainable prompt vectors but are prone to overfitting to base (“seen”) classes, resulting in limited transferability to novel (“unseen”) classes. This overfitting is attributed to inherent bias in few-shot training data distributions, which predominantly feature base classes. MoPD addresses this limitation by enabling effective transfer of class-discriminative knowledge from a diverse set of hand-crafted prompts to the learnable soft prompt via a controlled, sample-specific distillation process, thereby enhancing generalization performance (Chen et al., 2024).

2. Architecture and Workflow

The MoPD architecture introduces four major components:

  • Pre-trained CLIP backbone: The image and text encoders are kept frozen throughout adaptation.
  • Hard prompt pool: A set of HH hard prompts, denoted {Prompthard,t}t=1H\{\text{Prompt}_{\text{hard},t}\}_{t=1}^H, typically obtained via template or synonym variation, each inducing class embeddings thard,tc\mathbf{t}_{\text{hard},t}^c through the CLIP text encoder for class cc.
  • Soft prompt: A learnable sequence V=[v1,,vM,[CLASS]]V = [v_1,\dots,v_M,\text{[CLASS]}] of MM prompt vectors, prepended to each class name and encoded to produce tsoftc\mathbf{t}_{\text{soft}}^c.
  • Gating network GG: A single fully-connected layer with weights WgRd×HW_g \in \mathbb{R}^{d \times H}, taking an image embedding fRdf \in \mathbb{R}^d as input and selecting a sample-specific mixture of TT hard prompts using a sparse softmax over the top-TT gating scores.

The overall forward pass for an image xx with label yy proceeds as follows:

  1. f=ImageEncoder(x)f = \text{ImageEncoder}(x);
  2. Compute soft-prompt class probabilities psoft(yx)p_{\text{soft}}(y|x) via cosine similarity against tsoftc\mathbf{t}_{\text{soft}}^c with temperature τ\tau;
  3. For each teacher prompt tt, compute phard,t(yx)p_{\text{hard},t}(y|x) similarly;
  4. Compute gating weights wtw_t for the top-TT hard prompts;
  5. Evaluate and backpropagate losses into the soft prompt and gating network only, with frozen encoders (Chen et al., 2024).

3. Mathematical Formulation

Let psoft(cx)p_{\text{soft}}(c|x) and phard,t(cx)p_{\text{hard},t}(c|x) denote the predicted distributions over class cc using the soft prompt and the tt-th hard prompt, respectively:

psoft(cx)=exp(cos(tsoftc,f)/τ)cexp(cos(tsoftc,f)/τ),p_{\text{soft}}(c|x) = \frac{\exp\left(\text{cos}(\mathbf{t}_{\text{soft}}^c, f)/\tau\right)}{\sum_{c'} \exp\left(\text{cos}(\mathbf{t}_{\text{soft}}^{c'}, f)/\tau\right)},

phard,t(cx)=exp(cos(thard,tc,f)/τ)cexp(cos(thard,tc,f)/τ).p_{\text{hard},t}(c|x) = \frac{\exp\left(\text{cos}(\mathbf{t}_{\text{hard},t}^c, f)/\tau\right)}{\sum_{c'} \exp\left(\text{cos}(\mathbf{t}_{\text{hard},t}^{c'}, f)/\tau\right)}.

The composite MoPD loss combines three terms:

  • Cross-entropy on base classes:

LCE=(x,y)Dlogpsoft(yx)L_{\text{CE}} = -\sum_{(x,y)\in D}\log p_{\text{soft}}(y|x)

  • Mixture-of-prompts distillation (MPD) loss:

LMPD=xDt=1HwtKL(psoft(x)phard,t(x))L_{\text{MPD}} = \sum_{x \in D} \sum_{t=1}^H w_t \, \text{KL}(p_{\text{soft}}(\cdot|x)\,\|\, p_{\text{hard},t}(\cdot|x))

  • Mixture-of-prompts selection (MPS) loss:

LMPS=(x,y)Dt=1Hwtlogphard,t(yx)L_{\text{MPS}} = -\sum_{(x,y)\in D} \sum_{t=1}^H w_t \log p_{\text{hard},t}(y|x)

The full objective is:

L=αLCE+(1α)LMPD+βLMPSL = \alpha \cdot L_{\text{CE}} + (1-\alpha) \cdot L_{\text{MPD}} + \beta \cdot L_{\text{MPS}}

where α,β[0,1]\alpha, \beta \in [0, 1] are empirically selected trade-off coefficients (Chen et al., 2024).

Mixture weights ww are generated by applying a softmax to only the top-TT entries of the gating network’s outputs:

  • g=fWgg = f \cdot W_g,
  • wi=exp(gi)/jexp(gj)w_i = \exp(g_i) / \sum_j \exp(g_j), for ii among the top-TT entries, zero otherwise.

4. Training Procedure and Implementation Details

  • Backbones: ViT-B/16 image encoder and transformer text encoder (frozen).
  • Soft prompt: Length M=4M=4, initialized to “a photo of a”.
  • Hard prompt pool: H=12H=12 hard prompts (dataset-dependent).
  • Gating: Single FC layer; top-TT prompt selection, T=2 or 3T=2\text{ or }3.
  • Training: Parameters α=0.8\alpha=0.8 (or $0.5$ depending on dataset, e.g., HICO), β=5×104\beta=5\times10^{-4} (or $1.0$ for certain tasks).
  • Optimization: AdamW, learning rate 0.002\approx 0.002, 50 epochs for base-to-new, following the CoOp protocol for few-shot settings.
  • Evaluation:
    • Base-to-new: train on base (16 shots/class), test on base and unseen; report accbaseacc_{base}, accnewacc_{new}, H=2/(1/accbase+1/accnew)H = 2/(1/acc_{base} + 1/acc_{new}).
    • Few-shot: end-to-end training/testing on same classes for 1/2/4/8/16 shots.
    • Domain generalization: ImageNet-16-shot training, tested on ImageNet {V2, Sketch, A, R} (Chen et al., 2024).

5. Experimental Results

Base-to-New Generalization

Across 11 datasets with 16-shot base training, MoPD achieves:

Method accbaseacc_{base} accnewacc_{new} HH
CoOp 82.64 68.00 74.61
CoCoOp 75.83
ProGrad 76.15
KgCoOp 77.00
MoPD 81.40 74.69 77.90

MoPD attains the highest average HH and accnewacc_{new}, outperforming baselines on unseen classes by 1.1–2.5 points and achieving the highest harmonic mean in 10/11 datasets.

Few-Shot and Domain Generalization

MoPD consistently achieves the highest average accuracy in 1/2/4/8/16-shot classification across 11 datasets, demonstrating robustness in extremely limited data regimes.

For ImageNet-to-variant transfer, MoPD yields an average target accuracy of 60.29, outperforming KgCoOp (60.11), and shows particular improvements on Sketch (+0.13) and A (+0.08).

Ablation and Robustness

  • Single-prompt distillation (SiPD) vs. CoOp: HH improves from 74.61 to 77.04 (acc_base drops marginally).
  • MoPD (multi-prompt) vs. SiPD: HH increases by 0.86, accnewacc_{new} by 1.34.
  • Omitting MPS loss: HH decreases by 0.21.
  • Random prompt selection (MoPD-R) vs. learned gating: HH decreases by 0.64, accnewacc_{new} by 1.17.
  • MoPD remains robust under noise: with up to 24 noisy templates in a pool of 36, the gating network maintains HH within 0.1–0.3 points of clean pool performance, whereas random selection degrades HH by more than 2 points.

6. Impact and Comparison to Prior Work

MoPD achieves substantial improvements over existing soft prompt learning methods in terms of generalization to unseen classes. Prior art such as CoOp, CoCoOp, ProGrad, and KgCoOp suffer from seen-class bias under few-shot training due to insufficient knowledge transfer from hand-crafted prompt variations. By distilling knowledge from a dynamically selected, data-dependent mixture of hard prompts via a dedicated gating mechanism, MoPD demonstrates that soft prompt generalization can be systematically improved via mixture-based prompt distillation, rather than mere joint training or random hard prompt selection.

A plausible implication is that adaptive prompt selection—rather than static ensembling or random selection—is essential for robust transfer to out-of-distribution and unseen scenarios (Chen et al., 2024).

7. Conclusion and Future Directions

By enabling image-specific distillation from a select subset of hard prompts, Mixture-of-Prompts Distillation addresses core limitations of soft prompt adaptation in VLMs, particularly overfitting to seen classes and reduced generalization to novel categories. The competitive results across base-to-new, few-shot, and domain-generalization tasks establish MoPD as the new state-of-the-art in soft prompt learning under data scarcity and domain shift (Chen et al., 2024). This suggests future research directions in more advanced gating mechanisms, scalable prompt pools, and extension to non-vision-language architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Prompts Distillation (MoPD).