Papers
Topics
Authors
Recent
Search
2000 character limit reached

Iterative Label Mapping for Visual Prompting

Updated 27 April 2026
  • The paper introduces ILM-VP, which iteratively optimizes visual prompts and label mapping, significantly improving transfer accuracy over static approaches.
  • It employs a bi-level optimization strategy that alternates between updating input-space perturbations and dynamically reassigning source-target class pairs.
  • Empirical results demonstrate 3–8 percentage point gains on diverse datasets while maintaining a parameter-efficient adaptation framework.

Iterative Label Mapping-based Visual Prompting (ILM-VP) is a framework designed to enhance the effectiveness of visual prompting (VP) for transfer learning in vision tasks. Visual prompting reprograms a fixed, pre-trained source model to solve new, downstream tasks by optimizing universal input-space perturbations—termed “visual prompts”—and, crucially, specifying a label mapping between source and target class sets. ILM-VP introduces an iterative, bi-level optimization strategy that alternates prompt refinement with dynamic label remapping, demonstrably improving transfer accuracy compared to static or random mapping approaches, and offering extensibility to vision-LLMs such as CLIP (Chen et al., 2022).

1. Problem Setting and Formalization

Given a fixed pre-trained “source” classifier f:Rd{1,,S}f: \mathbb{R}^d \to \{1,\dots,|S|\} (e.g., ResNet-18 trained on ImageNet) and a “target” dataset Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N with xiRdx_i \in \mathbb{R}^{d'} and labels yi{1,,T}y_i \in \{1,\dots,|T|\}, the goal is to adapt ff to new tasks without fine-tuning. VP seeks a universal perturbation δRd\delta \in \mathbb{R}^d applied to all target images—embedding xx into the source input domain as x(δ)=h(x,δ)Rdx'(\delta) = h(x, \delta) \in \mathbb{R}^d—combined with a label mapping π:ST\pi: S \to T pairing source and target classes injectively. For a given (xi,yi)(x_i, y_i), the overall prediction is Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N0 where Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N1. Training optimizes

Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N2

where Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N3 is typically cross-entropy loss.

2. Label Mapping: Definitions and Metrics

A label mapping Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N4 is a one-to-one function from source to target classes. Two key metrics assess mapping quality:

  • Mapping Precision: The fraction of target classes for which the “correct” source partner is mapped back, given ground-truth alignment Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N5.

Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N6

High precision implies few mismatches.

  • Mapping Explanation: Average log-probability assigned by the source model to the mapped source label, over all prompted images of each target class.

Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N7

where Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N8 is the softmax output, and Dt={(xi,yi)}i=1N\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^N9 is the unique source class with xiRdx_i \in \mathbb{R}^{d'}0.

Both metrics empirically correlate strongly with VP target accuracy. Poor mappings—e.g., aligning “daisy” in Flowers102 to a visually dissimilar or semantically unrelated source class—degrade VP effectiveness.

3. The ILM-VP Framework and Bi-Level Optimization

ILM-VP abandons pre-fixed mappings in favor of an alternating optimization paradigm. It alternately updates the visual prompt xiRdx_i \in \mathbb{R}^{d'}1 and re-aligns the mapping xiRdx_i \in \mathbb{R}^{d'}2 over training epochs.

Bi-level formulation:

xiRdx_i \in \mathbb{R}^{d'}3

  • Lower-level (mapping update):

xiRdx_i \in \mathbb{R}^{d'}4

Practically, after xiRdx_i \in \mathbb{R}^{d'}5 epochs of prompt SGD, the mapping is recomputed—per-target-class—by frequency or average likelihood from prompted images. This iterative process is described in the following pseudocode:

ff0

This synergy between mapping refinement and prompt optimization yields more effective and interpretable mappings, as the two components reinforce each other: improved xiRdx_i \in \mathbb{R}^{d'}6 enhances prompt learning signal, while a stronger xiRdx_i \in \mathbb{R}^{d'}7 refines class alignment.

4. Extension to CLIP and Vision-LLMs

For CLIP, which pairs image and text encoders with a contrastive objective, the iterative mapping procedure incorporates text prompt selection per class. With xiRdx_i \in \mathbb{R}^{d'}8 candidate text templates xiRdx_i \in \mathbb{R}^{d'}9 and yi{1,,T}y_i \in \{1,\dots,|T|\}0 classes, each label–template pair yi{1,,T}y_i \in \{1,\dots,|T|\}1 forms a “virtual” source token in yi{1,,T}y_i \in \{1,\dots,|T|\}2 of size yi{1,,T}y_i \in \{1,\dots,|T|\}3. The approach proceeds as follows:

  1. Prompt Update: Optimize yi{1,,T}y_i \in \{1,\dots,|T|\}4 via SGD to minimize cross-entropy between yi{1,,T}y_i \in \{1,\dots,|T|\}5 and the top-scoring text token according to cosine similarity of image and text features.
  2. Text-Prompt Mapping: For each target class yi{1,,T}y_i \in \{1,\dots,|T|\}6, select the text template yi{1,,T}y_i \in \{1,\dots,|T|\}7 maximizing the average similarity on all samples with yi{1,,T}y_i \in \{1,\dots,|T|\}8.
  3. Iteration: Repeat prompt/image and mapping steps.

This iterative text prompt plus label mapping (TP+LM) strategy achieves substantial improvements over fixed template prompting, as evidenced by reported experimental gains.

5. Experimental Results and Comparative Analysis

Empirical validation spans 13 diverse target datasets (including Flowers102, CIFAR-10/100, DTD, Food101, GTSRB, and ABIDE) and multiple source models (ResNet-18, ResNet-50, ResNeXt-101). Evaluation baselines include:

  • RLM-VP: random one-to-one mapping
  • FLM-VP: fixed frequency mapping before prompting
  • LP: linear probe on source features
  • FF: full fine-tuning
  • VP+TP: CLIP with fixed text prompt

Selected results (ResNet-18 → Flowers102):

Method Accuracy (%)
RLM-VP ~11.0
FLM-VP ~20.0
ILM-VP ~27.9
LP ~88.0
FF ~97.1

For CLIP-based VP on Flowers102:

  • VP+TP (single prompt): 70.0%
  • VP+TP+LM (iterative text + label map): 83.7%

ILM-VP achieves consistent improvements (3–8 percentage point average gain) over RLM-VP and FLM-VP across all source/target pairs while preserving a parameter-efficient footprint—the prompt yi{1,,T}y_i \in \{1,\dots,|T|\}9 alone is learned, with no update to backbone network parameters. On CLIP, iterative mapping similarly produces gains of 5–15 percentage points over fixed strategies.

6. Practical Considerations and Future Directions

Key insights from the analysis and experimentation include:

  • High-quality label mapping is critical for effective VP; poor mapping can significantly impede transfer performance.
  • Static or pre-prompt mappings (even frequency-based) are consistently inferior to iterative, dynamic alignment.
  • Joint, bi-level optimization of prompt perturbations and label mapping leads to both greater accuracy and more semantically interpretable correspondences between domains.
  • The ILM-VP paradigm is naturally extensible to vision-language architectures; treating each (template, label) tuple as a virtual source enables per-class template selection and further boosts transfer.
  • Practical recommendations include initializing the mapping randomly or by frequency, alternating between prompt and mapping updates (10–50 prompt SGD steps per mapping re-estimation), and monitoring not only the primary loss but also the mapping precision and explanation metrics.

Potential extensions noted include partial reprogramming of intermediate model layers (“multilayer prompts”), graph-based or semantically-informed matching criteria, and adversarial-robust mapping tailored to noisy target domains (Chen et al., 2022).

7. Context and Significance

ILM-VP addresses a previously neglected consideration in visual prompting: the dynamic relationship between label mapping and prompt effectiveness. By formally quantifying mapping quality, introducing the explanation metric, and demonstrating empirical gains across models and tasks, ILM-VP establishes label mapping as a central component of visual reprogramming. The framework closes much of the performance gap to full fine-tuning with orders-of-magnitude lower adaptation cost and reveals promising directions for efficient transfer learning, especially in settings requiring parameter-efficient adaptation of large vision models (Chen et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Iterative Label Mapping (ILM-VP).