Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning (2207.07635v1)

Published 15 Jul 2022 in cs.CV, cs.LG, and stat.ML

Abstract: The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision can result in vision models with more transferable representations than traditional image-only methods. Our work studies this question through a carefully controlled comparison of two approaches in terms of their ability to learn representations that generalize to downstream classification tasks. We find that when the pre-training dataset meets certain criteria -- it is sufficiently large and contains descriptive captions with low variability -- image-only methods do not match CLIP's transfer performance, even when they are trained with more image data. However, contrary to what one might expect, there are practical settings in which these criteria are not met, wherein added supervision through captions is actually detrimental. Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Shibani Santurkar (26 papers)
  2. Yann Dubois (16 papers)
  3. Rohan Taori (14 papers)
  4. Percy Liang (239 papers)
  5. Tatsunori Hashimoto (80 papers)
Citations (41)