PLOT: Prompt Learning with Optimal Transport for Vision-Language Models (2210.01253v2)

Published 3 Oct 2022 in cs.CV, cs.CL, and cs.LG

Abstract: With the increasing attention to large vision-LLMs such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method. The code is available at https://github.com/CHENGY12/PLOT.

Citations (54)

View on Semantic Scholar

Summary

The paper introduces PLOT, a method that uses optimal transport to learn multiple distinct prompts for detailed vision-language alignment.
It employs a two-stage optimization with the Sinkhorn algorithm to minimize cross-modal discrepancies and enhance few-shot classification.
Experimental results across 11 benchmarks show that PLOT significantly improves recognition performance and generalizes well to domain shifts.

An Analytical Overview of PLOT: Prompt Learning with Optimal Transport for Vision-LLMs

The paper under discussion introduces PLOT, a novel approach to prompt learning for vision-LLMs, designed to enhance the adaptability of pre-trained large-scale Vision-LLMs like CLIP in few-shot learning scenarios. The authors focus on learning multiple prompts for each category rather than a single one, employing optimal transport (OT) to align vision and text modalities efficiently.

The primary innovation of PLOT lies in its use of optimal transport to learn a diverse set of prompts that capture various intrinsic attributes or contexts associated with a class. Unlike conventional methods that potentially converge prompts to a singular representation when matched to the same visual feature, PLOT introduces a mechanism that facilitates the fine-grained alignment between visual features and multiple textual prompts. This methodological advancement addresses the typical limitation of matching multiple prompts to a single visual feature, which often leads to redundancies.

Methodological Framework

PLOT's implementation starts with the foundational application of optimal transport (OT) to bridge visual and textual feature spaces. This approach involves:

Visual and Textual Feature Representation: The model employs feature sets from both images and textual prompts. For a given image, local visual features from feature maps (extracted from CLIP's visual encoder) are utilized rather than relying solely on global features. Multiple prompts are generated for each class, intended to explore and capture different aspects or characteristics of the class in question.
Two-stage Optimization Strategy: The method applies a two-stage optimization strategy. The first stage involves adapting the OT framework using the Sinkhorn algorithm to determine the transport plan and compute the transport distance. In the second stage, this distance metric supervises the learning of prompts, ensuring they are distinct and comprehensive.
Cross-modal Learning through Optimal Transport: By formulating visual and textual discrepancies as a discrete distribution problem, the OT distance is minimized to align these cross-modal features. This is conducted in an iterative scheme where the visual and textual features lead to an optimal alignment reflecting various aspects of the instances of each class.

Experimental Results and Implications

The experiments, conducted over 11 benchmark datasets spanning generic objects, scenes, and actions, demonstrated significant improvements in few-shot learning performances. Specifically, results indicated substantial gains in recognition tasks compared to CLIP adaptations that apply single-prompt learning paradigms. These results underscore PLOT's capability to utilize vision-language pretrained knowledge more effectively, notably in settings where limited labeled data per class is available.

The evaluation also considered robustness to domain shifts using variants of the ImageNet dataset. Here, PLOT reportedly performed better than its baselines, suggesting enhanced generalization capabilities originating from its robust multi-faceted prompt representations.

Implications and Future Directions

Practically, PLOT introduces an effective method for applications where adaptability to new classifications with minimal data is essential. Theoretically, it opens avenues for applying optimal transport in cross-modal learning, demonstrating potential beyond the field of vision-language correlation, such as in multi-modal fusion tasks.

Further, the method's adaptation to fewer training shots without structural model changes offers substantial computational and resource efficiency, aligning well with real-world operational constraints.

For future advancements, the exploration of adaptive prompt count and complexity could refine the balance between computation load and accuracy. Additionally, investigating the OT framework's applicability in zero-shot translation tasks may offer further clarity on its comprehensive utility.

In conclusion, this paper proposes a technically sound and efficiently designed system aimed at advancing the use of large-scale vision-LLMs in constrained learning environments, establishing a new benchmark for prompt learning methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - CHENGY12/PLOT: [ICLR2023] PLOT: Prompt Learning with Optimal Transport for Vision-Language Models (167 stars)