Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tuning Multi-mode Token-level Prompt Alignment across Modalities (2309.13847v2)

Published 25 Sep 2023 in cs.CV

Abstract: Advancements in prompt tuning of vision-LLMs have underscored their potential in enhancing open-world visual concept comprehension. However, prior works only primarily focus on single-mode (only one prompt for each modality) and holistic level (image or sentence) semantic alignment, which fails to capture the sample diversity, leading to sub-optimal prompt discovery. To address the limitation, we propose a multi-mode token-level tuning framework that leverages the optimal transportation to learn and align a set of prompt tokens across modalities. Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity. Consequently, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets. Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Dongsheng Wang (47 papers)
  2. Miaoge Li (7 papers)
  3. Xinyang Liu (43 papers)
  4. MingSheng Xu (5 papers)
  5. Bo Chen (309 papers)
  6. Hanwang Zhang (161 papers)
Citations (11)
Github Logo Streamline Icon: https://streamlinehq.com