Papers
Topics
Authors
Recent
Search
2000 character limit reached

Progressive Visual Prompt Learning (ProVP-Ref)

Updated 3 April 2026
  • The paper introduces a progressive visual prompt propagation strategy with contrastive feature re-formation to adapt frozen CLIP models for diverse, data-scarce tasks.
  • It leverages adaptive prompts across multiple transformer layers, enhancing instance-specific information flow and achieving state-of-the-art results on 11 benchmarks.
  • Experimental findings demonstrate improved accuracy, stability, and transferability over traditional prompt tuning methods in both few-shot and base-to-novel settings.

Progressive Visual Prompt Learning with Contrastive Feature Re-formation (ProVP-Ref) is a parametric adaptation technique designed to efficiently adapt frozen vision-language (V-L) models, particularly CLIP-based transformers, to diverse downstream tasks with limited labeled data. ProVP-Ref introduces structured progressive visual prompt propagation, where prompts are injected and adaptively refined across multiple transformer layers, and augments this architecture with contrastive feature re-formation to maintain the fidelity of model outputs relative to the pre-trained embedding space. The approach demonstrates state-of-the-art adaptation and generalization in both few-shot and “base-to-novel” classification scenarios, offering substantive stability and transfer advantages over prior prompt- and fine-tuning-based adaptation mechanisms (Xu et al., 2023).

1. Model Architecture and Progressive Prompt Design

ProVP-Ref operates atop a frozen CLIP ViT-B/16 backbone consisting of NN transformer layers. For each layer ll, a learnable prompt matrix PlRm×dP^l \in \mathbb{R}^{m \times d} (mm \ll number of patch tokens, dd = token dimension) is introduced. Prompt tokens at layer ll are progressively fused with the output prompt representation from the preceding layer, promoting both cross-layer information flow and task-specific instance adaptivity.

Formally, for each input image representation X0X_0 (with prepended [CLS] token),

  • At l=1l = 1, the input to the first transformer layer is [X0;P1][X_0; P^1].
  • At l2l \geq 2, the next prompt input is mixed by linear decay:

ll0

where ll1 is the prompt output from layer ll2 and ll3 is a fixed decay hyperparameter (ll4 yields best stability and performance).

Each transformer layer thus consumes ll5 and outputs the updated sequence ll6 and updated prompt embeddings ll7. The final image representation ll8 is derived by pooling the output ll9 after the last layer.

This mechanism enables instance-adaptive prompt evolution across layers, shown to outperform both earlier text- and visual-prompt tuning frameworks in stability and adaptation efficacy.

2. Mathematical Formulation

The prompt-injected transformer layer forward update for PlRm×dP^l \in \mathbb{R}^{m \times d}0 is:

PlRm×dP^l \in \mathbb{R}^{m \times d}1

with PlRm×dP^l \in \mathbb{R}^{m \times d}2 at the initial layer. Here PlRm×dP^l \in \mathbb{R}^{m \times d}3 denotes the PlRm×dP^l \in \mathbb{R}^{m \times d}4-th transformer block. The fusion strategy at each layer is:

PlRm×dP^l \in \mathbb{R}^{m \times d}5

This progressive design enables prompt outputs to propagate into deeper layers, facilitating adaptation both to individual instances and across the distribution of downstream tasks.

The complete adaptation update generalizes as:

PlRm×dP^l \in \mathbb{R}^{m \times d}6

where PlRm×dP^l \in \mathbb{R}^{m \times d}7 encodes the intermediate non-prompted representations.

3. Contrastive Feature Re-formation

To address potential deviation from the original CLIP embedding distribution and to promote generalization, ProVP-Ref employs a contrastive feature re-formation loss. For minibatch PlRm×dP^l \in \mathbb{R}^{m \times d}8, let PlRm×dP^l \in \mathbb{R}^{m \times d}9 be the original frozen CLIP feature, and mm \ll0 the adapted (prompted) feature. The reforming loss for a single sample is:

mm \ll1

(optionally temperature-scaled). This term encourages the learned feature to remain close to the reference embedding for the same image, mitigating generalization collapse caused by unbounded prompt adaptation.

The overall optimization objective combines the standard cross-entropy loss over cosine similarities with the contrastive re-formation loss:

mm \ll2

where mm \ll3 is set per dataset (e.g., mm \ll4 for large datasets and mm \ll5 for others).

4. Training Schedule and Implementation

Experiments are conducted on 11 image benchmarks, including ImageNet, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, and UCF101. The training regime varies across settings:

  • Few-shot: mm \ll6 shots per class, with 50–200 epochs (fewer for lower shot-counts).
  • Base-to-novel: 16-shot training on base classes; tested on both base and novel, using 100 epochs.

Prompts are composed of 50 tokens per layer in few-shot and 16 per layer in base-to-novel settings. Optimization uses SGD (batch size 32, weight decay between mm \ll7 and mm \ll8, learning rates up to 5.0 for select datasets). All adaptation occurs atop a frozen CLIP image encoder; the text encoder is fixed, with hand-crafted class descriptions.

5. Experimental Results and Comparative Performance

ProVP-Ref achieves state-of-the-art results on both few-shot and base-to-novel image classification tasks. For example, on the 11-dataset average (16-shot, few-shot setting):

Method Avg. Accuracy (%)
CLIP zero-shot 75.46
CoOp 80.24
ProGrad 79.84
VPT-Deep 82.59
ProVP 82.96
ProVP-Ref 83.07

On base-to-novel splits (averaged harmonic mean mm \ll9 over 11 datasets, 16-shot):

Method Base Novel dd0
CLIP 69.3 74.2 71.7
CoOp 82.7 63.2 71.7
CoCoOp 80.5 71.7 75.8
ProGrad 81.9 71.8 76.5
ProVP 85.1 69.6 76.6
ProVP-Ref 85.2 73.2 78.8

ProVP-Ref surpasses both text- and visual-prompt competitors (e.g., VPT-Deep, CoOp, ProGrad), with especially strong margins on datasets with substantial domain shift (e.g., EuroSAT, FGVCAircraft, UCF101, DTD).

6. Ablation Analysis and Key Insights

Ablation studies affirm that progressive visual prompt propagation yields significant benefits in accuracy and training stability relative to non-progressive visual prompt tuning (VPT). For example, in the few-shot 1-shot regime, VPT-Deep achieves 67.2% while ProVP attains 70.9% (+3.7%). Similarly, in the base-to-novel evaluation, ProVP-Ref delivers a 3.6% accuracy gain over VPT-Deep on novel classes.

Prompt propagation into deeper layers consistently produces larger gains compared to shallow prompt insertion. The decay hyperparameter dd1 is best set to 0.1; increasing dd2 to 1 (greater reliance on previous prompt output) degrades performance. The contrastive feature re-formation loss dd3 monotonically improves novel-class accuracy as its weight dd4 increases.

Parameter efficiency is observed: even controlling for total prompt parameters, ProVP-Ref outperforms text-only prompt baselines, indicating the effectiveness of learned visual prompt structures and cross-layer information flow.

7. Significance, Limitations, and Future Research Directions

ProVP-Ref establishes visual prompt tuning as an effective means of adapting vision transformers to both data-scarce and domain-shifted tasks. Its progressive and instance-adaptive prompt structure fosters both robustness and flexibility. The contrastive re-formation constraint is crucial for maintaining distributional proximity to pre-trained embeddings, enhancing generalization to unseen classes.

Notable limitations include performance bottlenecks imposed by a fixed text encoder in settings with high class counts. Substituting hand-crafted class prompts with learned textual prompts partially alleviates such bottlenecks.

Areas for further research include joint, interactive training of visual and textual prompts, and development of alternative embedding-space constraints for enhanced stability and transferability (Xu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Progressive Visual Prompt Learning (ProVP-Ref).