Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP (2310.00927v2)

Published 2 Oct 2023 in cs.LG, cs.AI, and stat.ML

Abstract: Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that employs vision-language contrastive pretraining to learn joint image and text representations and exhibits remarkable performance in zero-shot learning and text-guided natural image generation. Despite the huge practical success of CLIP, its theoretical understanding remains elusive. In this paper, we formally study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. We also analyze its zero-shot transfer performance on the downstream tasks. Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zixiang Chen (28 papers)
  2. Yihe Deng (16 papers)
  3. Yuanzhi Li (119 papers)
  4. Quanquan Gu (198 papers)
Citations (9)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets