RankCLIP: Ranking-Consistent Language-Image Pretraining (2404.09387v2)

Published 15 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-LLMs in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (67)

Authors (6)

Yiming Zhang (128 papers)
Zhuokai Zhao (21 papers)
Zhaorun Chen (28 papers)
Zhili Feng (22 papers)
Zenghui Ding (4 papers)
Yining Sun (8 papers)

Citations (6)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1780582360876003459

RankCLIP: Ranking-Consistent Language-Image Pretraining (2404.09387v2)

Related Papers

Tweets