An Evaluation of Visual Prompts for Adaptation of Large-Scale Vision Models
This paper explores the concept of visual prompting as a potential method for adapting large-scale models, particularly in the domain of vision, analogous to how prompting has been employed in NLP. In the NLP framework, large pre-trained LLMs are adapted to new tasks by converting downstream datasets into the format used during pre-training, thereby modifying the data space without altering the model parameters. This research aims to explore whether similar prompts can be applied in the visual domain to pre-trained models in order to perform new visual tasks.
The authors build on techniques such as prompt tuning and adversarial reprogramming, where input perturbations—labelled here as "visual prompts"—are learned so that a frozen model, when prompted, performs a new task. The visual prompting strategy involves creating a single, task-specific image perturbation administered to inputs, with model parameters remaining unchanged. This allows the model to be adapted to new tasks without the necessity of fine-tuning hidden layers or adjusting network weights.
A significant portion of the paper focused on CLIP (Contrastive Language–Image Pretraining), a vision-LLM by OpenAI, demonstrating that visual prompts can achieve performance competitive with traditional adaptation methods such as the linear probe. Using CLIP, the paper shows that visual prompting achieves competitive accuracy across several datasets when compared to the linear probe standard. Notably, the approach was robust against distribution shifts, showing promise in domains where training and testing data come from different sources—a common real-world scenario which poses challenges for conventional fine-tuning approaches.
Key findings include that CLIP demonstrated superior adaptability to visual prompts over other vision models tested, such as ResNeXt, Big Transfer (BiT), and traditional ResNet architectures. Furthermore, visual prompts help bridge the distribution gap from pre-trained datasets like ImageNet to more diverse downstream datasets by modifying the input space. Comprehensive experiments showed mixed results across different datasets, offering meaningful insights into which dataset properties affect the performance of visual prompting.
The authors highlight that while this paper predominantly focuses on universal, input-agnostic prompts, future explorations could consider input-conditional prompting to possibly enhance performance further, particularly for datasets with high perceptual diversity. Various design choices in prompt implementation—such as pixel size, location, and padding—have an influential effect on performance, with certain configurations decreasing performance as the number of prompt parameters increased. For CLIP, even simplistic prompt designs, such as modifying a single pixel, demonstrated improved classification accuracy, thus revealing interesting dynamics between prompt dimensionality and task adaptation.
A principal conclusion drawn is that visual prompting serves as a resource-efficient and practical adaptation strategy, particularly where task-specific feature customization is necessary, and access to adjust model internals like weights is restricted. From a theoretical standpoint, this research presents visual prompting as a compelling perspective for exploring alternative large-scale model adaptations within computer vision, expanding the flexibility of deployment across diverse applications without incurring the high computational over-head typically associated with fine-tuning.
The authors speculate future directions based on the promising results observed, suggesting applications for visual prompting that could go beyond image classification tasks, potentially influencing other domains of vision tasks by steering pre-trained models using input-space modifications. These developments could offer enhanced control and adaptability to end-users without requiring access to the architectural internals of the deployed model, aligning with trends towards model efficacy and accessibility.