VaMP: Variational Multi-Modal Prompt Learning

Updated 4 December 2025

The paper introduces VaMP, a probabilistic prompt-tuning paradigm that adapts prompts using variational inference, capturing instance-specific uncertainties and class semantics.
It leverages a novel latent prompt modeling framework with class-aware priors to enhance few-shot learning and domain generalization in vision-language models.
Experimental results indicate that VaMP achieves state-of-the-art performance with marginal latency increase, supporting extensions to varied multimodal tasks such as segmentation and action recognition.

Variational Multi-Modal Prompt Learning (VaMP) is a probabilistic prompt-tuning paradigm for vision-LLMs (VLMs) enabling instance-conditioned, uncertainty-aware adaptation of prompts to input content and class semantics. Building on foundational limitations of deterministic prompt learning and recent advances in variational prompt modeling, VaMP structurally unifies instance-specific prompt generation, semantic consistency through class-aware priors, and end-to-end training by variational inference. It achieves state-of-the-art results in few-shot and domain generalization settings, demonstrates compatibility with diverse CLIP-style backbones, and offers extendability to a range of multimodal tasks (Cheng et al., 27 Nov 2025).

1. Motivation and Limitations of Deterministic Prompt Learning

Existing multi-modal prompt learning for VLMs, such as CLIP and its soft prompt-tuning extensions, typically uses deterministic and globally shared prompt tokens. These approaches lack the capacity to adapt the prompt representation to instance-specific visual or semantic cues and do not quantify epistemic uncertainty, which is critical for robust generalization, particularly in few-shot or out-of-distribution regimes. Additionally, standard practice assumes isotropic Gaussian priors in the prompt latent space, disregarding inter-class semantic variability (Cheng et al., 27 Nov 2025). This framework constrains model expressivity and impairs robust transfer.

2. Probabilistic Formulation: Latent Prompt Modeling and Inference

VaMP recasts prompt tuning as variational inference over a learned latent prompt variable sequence $z = \{z_i\}_{i=J}^{J+H-1}$ , where each $z_i$ is injected as a prompt token at transformer layer $i$ in the frozen CLIP text encoder. For each input image $x$ and label $y$ , the framework models an input- and class-conditioned posterior over $z$ :

$q_\phi(z\mid x, y) = \prod_{i=J}^{J+H-1} \mathcal N(z_i; \mu_i, \mathrm{diag}(\sigma_i^2)), \quad [\mu_i, \sigma_i] = \phi_i(\bar f_x)$

where $\bar f_x$ denotes the frozen CLIP image embedding and the $\phi_i$ are small MLPs. Uncertainty and sample-specificity are captured by the distributional posterior. VaMP employs a class-aware prior, constructed from class prototypes $o_y = \frac{1}{|D_y|}\sum_{x'\in D_y}\bar f_{x'}$ using frozen CLIP features; each layer's prior is parameterized as $[\hat\mu_i, \hat\sigma_i] = \psi_i(o_y)$ for MLPs $\psi_i$ . The prior thus encodes semantic structure beyond mere isotropy (Cheng et al., 27 Nov 2025).

3. Evidence Lower Bound Objective and Optimization

Variational training seeks to maximize the marginal log-likelihood of target labels under the prompted model, realized via the evidence lower bound (ELBO):

$\log p(y\mid x, t) \geq \mathbb{E}_{z \sim q_\phi(z\mid x, y)}[\log p(y\mid x, t, z)] - \mathrm{KL}(q_\phi(z\mid x, y) \| p_\theta(z\mid x, y))$

The first term encourages correct predictions conditioned on the sampled latent prompt sequence $z$ ; the second term regularizes the instance posterior against the class-aware prior, maintaining semantic consistency. Latent sampling is performed per instance and per prompt layer via the reparameterization trick:

$z_i = \mu_i + \sigma_i \odot \epsilon_i, \quad \epsilon_i \sim \mathcal N(0, I)$

All MLPs and prompt generator components are trained end-to-end with the CLIP backbones remaining frozen. Monte Carlo estimation is used for the ELBO expectation in practice (Cheng et al., 27 Nov 2025).

4. Prompt Generation, Model Architecture, and Integration

The VaMP framework is structured as follows:

Visual Encoder: Frozen CLIP ViT extracts image embedding $\bar f_x \in \mathbb R^{d_{vl}}$ .
Posterior MLPs: $\{\phi_i\}$ produce posterior means and standard deviations for each of $H$ prompt-injected transformer layers.
Prior MLPs: $\{\psi_i\}$ generate class-anchored Gaussian priors using the class prototype $o_y$ .
Prompt Injection: Layer-wise prompt tokens $z_i$ sampled from the posterior are concatenated to text encoder inputs at layers $J, \dots, J+H-1$ .
Visual-side Prompts: Optional deterministic visual prompts $\tilde z_i$ may be injected into the image encoder, but uncertainty modeling is reserved for the text-side.
Prediction: Prompted image and text features are projected and scored by cosine similarity, with softmax yielding class probabilities.

The architecture is designed to support efficient end-to-end training while retaining parameter efficiency (Cheng et al., 27 Nov 2025).

5. Class-Aware Priors and Semantic Consistency

A distinct innovation in VaMP is the class-aware prior construction, where:

$p_\theta(z\mid x, y) = \prod_{i=J}^{J+H-1} \mathcal N(z_i; \hat\mu_i, \mathrm{diag}(\hat\sigma_i^2)), \quad [\hat\mu_i, \hat\sigma_i] = \psi_i(o_y)$

By computing $o_y$ as the empirical mean of all embeddings from a class and mapping it through small prior MLPs, VaMP regularizes prompt representations such that samples from the same class are close in the latent space. This promotes intra-class coherence while supporting inter-class discrimination. Compared to standard Gaussian priors, this yields a measurable gain in downstream generalization metrics (Cheng et al., 27 Nov 2025).

6. Empirical Results: Generalization, Robustness, and Efficiency

Few-Shot and Domain Generalization

In evaluation across 11 few-shot (16-shot) vision-language settings, VaMP outperforms leading baselines such as MMRL, improving novel class accuracy from 77.16% to 78.67% and raising the harmonic mean by ≈1.17 points (81.20→82.37). On ImageNet variant domain generalization, average top-1 accuracy is 61.73% (+1.20% vs. MMRL), with similar gains on cross-dataset transfer (10 unseen datasets: 67.74% vs. 67.25%).

Ablations and Uncertainty Analysis

Ablation studies demonstrate that sample-specific, probabilistic prompts (vs. deterministic or task-specific) yield a +1.04–1.28 HM gain. The use of class-aware priors rather than standard Gaussians contributes a +0.55 harmonic mean improvement (Cheng et al., 27 Nov 2025).

Latency and Scalability

With $S=10$ Monte Carlo samples at inference, the increase in latency is marginal (0.8 ms/image on NVIDIA V100), while measurable improvements in harmonic mean are achieved. The method is compatible with EVA-CLIP, SigLIP, SigLIP2, and other recent CLIP-style backbones.

Task Extensions

VaMP demonstrates successful adaptation to open-vocabulary segmentation (CAT-Seg, +1.1 mIoU) and action recognition (FROSTER, +0.6 HM), indicating architectural generality (Cheng et al., 27 Nov 2025).

7. Relationship to Variational Prompt Modeling and Future Directions

Prior work on modeling prompt distributional robustness, such as MVP (Modeling Variants of Prompts) (Li et al., 11 Mar 2025), focuses on invariance to prompt template structure by decoupling templates and class names, then using a VAE to encode template structural variation. MVP's approach mitigates prompt sensitivity in natural language templates but does not address instance-specific, image-conditioned adaptation or uncertainty modeling. By contrast, VaMP broadens the variational principle: it enables prompt representations to be adapted per input instance, integrates a class-aware latent space, and unifies prompt tuning with semantic regularization.

Potential extensions for VaMP include more expressive priors (e.g., with normalizing flows), dynamic updating of class prototypes, and expanded application to multi-modal reasoning, compositional prompts, and adversarial robustness. This suggests that a unifying variational framework for both prompt robustness and input-aware adaptation is a promising trajectory for future vision-language research (Cheng et al., 27 Nov 2025, Li et al., 11 Mar 2025).