Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models (2209.07511v1)

Published 15 Sep 2022 in cs.CV

Abstract: Pre-trained vision-LLMs (e.g., CLIP) have shown promising zero-shot generalization in many downstream tasks with properly designed text prompts. Instead of relying on hand-engineered prompts, recent works learn prompts using the training data from downstream tasks. While effective, training on domain-specific data reduces a model's generalization capability to unseen new domains. In this work, we propose test-time prompt tuning (TPT), a method that can learn adaptive prompts on the fly with a single test sample. For image classification, TPT optimizes the prompt by minimizing the entropy with confidence selection so that the model has consistent predictions across different augmented views of each test sample. In evaluating generalization to natural distribution shifts, TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average, surpassing previous prompt tuning approaches that require additional task-specific training data. In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data. Project page: https://azshue.github.io/TPT.

Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-LLMs

The proliferation of pre-trained vision-LLMs, such as CLIP and ALIGN, has highlighted the potential for deploying robust models that handle a variety of computer vision tasks in a zero-shot manner. These models leverage vast amounts of image and text pairs for training, enabling them to recognize, classify, and draw inferences without needing additional labeled data in the target domain. A critical component of their success is the design of effective text prompts, typically hand-crafted, which serve to link textual and visual inputs. However, the reliance on domain-specific heuristics can undermine the generalization capability across unseen domains, presenting a limitation for these models.

In this context, the paper introduces a novel approach, Test-Time Prompt Tuning (TPT), addressing the challenge of designing prompts dynamically and efficiently during inference. Unlike prior works that require training prompts on domain-specific data—which may inadvertently compromise a model’s ability to generalize—TPT tunes prompts on-the-fly, utilizing only a single test sample without additional task-specific data or annotations.

Methodology and Technical Contribution

The proposed TPT method operates under the premise that a robust vision-LLM should exhibit consistent predictions across varied augmented views of the same test image. To this end, TPT employs an objective function aiming to minimize marginal entropy across these augmented predictions, thereby refining the prompt at test time. A critical enhancement introduced is the concept of confidence selection, which filters out augmented views deemed unreliable—characterized by high entropy or low confidence—ensuring that only stable predictions influence the tuning process. This is achieved without distorting the model's pre-trained feature set, by strictly optimizing the prompts rather than the model parameters themselves, thus preserving the native zero-shot capabilities.

Two distinct applications serve to evaluate the effectiveness of TPT: image classification amid natural distribution shifts and context-dependent visual reasoning within the Bongard-HOI framework. The method demonstrates a consistent improvement in top-1 accuracy by 3.6% in zero-shot scenarios over baseline CLIP configurations on tasks involving natural distribution shifts. Notably, TPT holds its ground against state-of-the-art approaches requiring additional training data, indicating that test-time prompt adaptation can indeed rival more data-intensive methods.

Implications and Future Directions

The implications of TPT are profound, showing promise in extending the usability of vision-LLMs across diverse domains without the burden of retraining on extensive annotated datasets. This adaptability is crucial, as models might encounter unforeseen data distributions in practical deployments. Moreover, with an eye on further developments, expanding TPT to other multi-modal foundation models or incorporating more sophisticated augmentation and selection mechanisms could unlock new frontiers in zero-shot learning.

Furthermore, TPT’s paradigm shift also raises interesting research questions regarding the bounds of prompt-based tuning in achieving generalization across a spectrum of complex tasks, including those encompassing varying modalities beyond text and vision. As AI systems increasingly seek generalist capabilities, methodologies like TPT that offer streamlined, data-efficient adaptability will form a central backbone in evolving AI development and deployment strategies.

Overall, this work presents an elegant solution to a pressing problem within the domain of pre-trained vision-LLMs, setting the stage for subsequent innovation in the synthesis and application of adaptive learning systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Manli Shu (23 papers)
  2. Weili Nie (41 papers)
  3. De-An Huang (45 papers)
  4. Zhiding Yu (94 papers)
  5. Tom Goldstein (226 papers)
  6. Anima Anandkumar (236 papers)
  7. Chaowei Xiao (110 papers)
Citations (224)
Github Logo Streamline Icon: https://streamlinehq.com