Iterative Label Mapping for Visual Prompting
- The paper introduces ILM-VP, which iteratively optimizes visual prompts and label mapping, significantly improving transfer accuracy over static approaches.
- It employs a bi-level optimization strategy that alternates between updating input-space perturbations and dynamically reassigning source-target class pairs.
- Empirical results demonstrate 3–8 percentage point gains on diverse datasets while maintaining a parameter-efficient adaptation framework.
Iterative Label Mapping-based Visual Prompting (ILM-VP) is a framework designed to enhance the effectiveness of visual prompting (VP) for transfer learning in vision tasks. Visual prompting reprograms a fixed, pre-trained source model to solve new, downstream tasks by optimizing universal input-space perturbations—termed “visual prompts”—and, crucially, specifying a label mapping between source and target class sets. ILM-VP introduces an iterative, bi-level optimization strategy that alternates prompt refinement with dynamic label remapping, demonstrably improving transfer accuracy compared to static or random mapping approaches, and offering extensibility to vision-LLMs such as CLIP (Chen et al., 2022).
1. Problem Setting and Formalization
Given a fixed pre-trained “source” classifier (e.g., ResNet-18 trained on ImageNet) and a “target” dataset with and labels , the goal is to adapt to new tasks without fine-tuning. VP seeks a universal perturbation applied to all target images—embedding into the source input domain as —combined with a label mapping pairing source and target classes injectively. For a given , the overall prediction is 0 where 1. Training optimizes
2
where 3 is typically cross-entropy loss.
2. Label Mapping: Definitions and Metrics
A label mapping 4 is a one-to-one function from source to target classes. Two key metrics assess mapping quality:
- Mapping Precision: The fraction of target classes for which the “correct” source partner is mapped back, given ground-truth alignment 5.
6
High precision implies few mismatches.
- Mapping Explanation: Average log-probability assigned by the source model to the mapped source label, over all prompted images of each target class.
7
where 8 is the softmax output, and 9 is the unique source class with 0.
Both metrics empirically correlate strongly with VP target accuracy. Poor mappings—e.g., aligning “daisy” in Flowers102 to a visually dissimilar or semantically unrelated source class—degrade VP effectiveness.
3. The ILM-VP Framework and Bi-Level Optimization
ILM-VP abandons pre-fixed mappings in favor of an alternating optimization paradigm. It alternately updates the visual prompt 1 and re-aligns the mapping 2 over training epochs.
Bi-level formulation:
- Upper-level (prompt learning):
3
- Lower-level (mapping update):
4
Practically, after 5 epochs of prompt SGD, the mapping is recomputed—per-target-class—by frequency or average likelihood from prompted images. This iterative process is described in the following pseudocode:
0
This synergy between mapping refinement and prompt optimization yields more effective and interpretable mappings, as the two components reinforce each other: improved 6 enhances prompt learning signal, while a stronger 7 refines class alignment.
4. Extension to CLIP and Vision-LLMs
For CLIP, which pairs image and text encoders with a contrastive objective, the iterative mapping procedure incorporates text prompt selection per class. With 8 candidate text templates 9 and 0 classes, each label–template pair 1 forms a “virtual” source token in 2 of size 3. The approach proceeds as follows:
- Prompt Update: Optimize 4 via SGD to minimize cross-entropy between 5 and the top-scoring text token according to cosine similarity of image and text features.
- Text-Prompt Mapping: For each target class 6, select the text template 7 maximizing the average similarity on all samples with 8.
- Iteration: Repeat prompt/image and mapping steps.
This iterative text prompt plus label mapping (TP+LM) strategy achieves substantial improvements over fixed template prompting, as evidenced by reported experimental gains.
5. Experimental Results and Comparative Analysis
Empirical validation spans 13 diverse target datasets (including Flowers102, CIFAR-10/100, DTD, Food101, GTSRB, and ABIDE) and multiple source models (ResNet-18, ResNet-50, ResNeXt-101). Evaluation baselines include:
- RLM-VP: random one-to-one mapping
- FLM-VP: fixed frequency mapping before prompting
- LP: linear probe on source features
- FF: full fine-tuning
- VP+TP: CLIP with fixed text prompt
Selected results (ResNet-18 → Flowers102):
| Method | Accuracy (%) |
|---|---|
| RLM-VP | ~11.0 |
| FLM-VP | ~20.0 |
| ILM-VP | ~27.9 |
| LP | ~88.0 |
| FF | ~97.1 |
For CLIP-based VP on Flowers102:
- VP+TP (single prompt): 70.0%
- VP+TP+LM (iterative text + label map): 83.7%
ILM-VP achieves consistent improvements (3–8 percentage point average gain) over RLM-VP and FLM-VP across all source/target pairs while preserving a parameter-efficient footprint—the prompt 9 alone is learned, with no update to backbone network parameters. On CLIP, iterative mapping similarly produces gains of 5–15 percentage points over fixed strategies.
6. Practical Considerations and Future Directions
Key insights from the analysis and experimentation include:
- High-quality label mapping is critical for effective VP; poor mapping can significantly impede transfer performance.
- Static or pre-prompt mappings (even frequency-based) are consistently inferior to iterative, dynamic alignment.
- Joint, bi-level optimization of prompt perturbations and label mapping leads to both greater accuracy and more semantically interpretable correspondences between domains.
- The ILM-VP paradigm is naturally extensible to vision-language architectures; treating each (template, label) tuple as a virtual source enables per-class template selection and further boosts transfer.
- Practical recommendations include initializing the mapping randomly or by frequency, alternating between prompt and mapping updates (10–50 prompt SGD steps per mapping re-estimation), and monitoring not only the primary loss but also the mapping precision and explanation metrics.
Potential extensions noted include partial reprogramming of intermediate model layers (“multilayer prompts”), graph-based or semantically-informed matching criteria, and adversarial-robust mapping tailored to noisy target domains (Chen et al., 2022).
7. Context and Significance
ILM-VP addresses a previously neglected consideration in visual prompting: the dynamic relationship between label mapping and prompt effectiveness. By formally quantifying mapping quality, introducing the explanation metric, and demonstrating empirical gains across models and tasks, ILM-VP establishes label mapping as a central component of visual reprogramming. The framework closes much of the performance gap to full fine-tuning with orders-of-magnitude lower adaptation cost and reveals promising directions for efficient transfer learning, especially in settings requiring parameter-efficient adaptation of large vision models (Chen et al., 2022).