Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Parrot Captions Teach CLIP to Spot Text (2312.14232v3)

Published 21 Dec 2023 in cs.CV and cs.AI

Abstract: Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

Understanding CLIP's Visual Text Bias

Text Bias in Vision-LLMs

CLIP (Contrastive Language–Image Pretraining) serves as a significant foundation in the field of AI, aiding numerous vision-language tasks. The model's ability to link visual content with relevant text descriptions is pivotal. Unfortunately, it has been brought to light that CLIP models exhibit a potent bias in identifying and fixating on embedded text within images. This propensity for "parroting"—mimicking the text in the pictures rather than interpreting the genuine visual content—raises serious concerns about the efficacy of these models in comprehending visual semantics.

To explore this, a comprehensive paper was conducted on the predominant image-text dataset, LAION-2B, revealing that a considerable portion of the dataset's images contains visual text. Moreover, a large majority of the image captions directly repeat words found in the embedded image text, underscoring the emphasis that CLIP models place on the text within images, at times to the detriment of the visual context.

The Role of Dataset Curation

The analysis further disclosed that the current trend of data curation, which heavily relies on CLIP model-derived scores, is unintentionally but inevitably promoting this visual text bias. By studying various versions of CLIP models, it became clear that the models are indeed text-centric in evaluating image-text pairs. Training these models on data subsets curated with a focus on embedded text significantly shaped the models' text spotting capabilities and, as an unintended consequence, impaired their understanding of visual context.

Consequences for Representation Learning

Diving deeper into the influence of such biases, a series of CLIP models were trained on carefully chosen subsets of LAION-2B data. Models trained on data with high "parrot captions," though adept at spotting text, showcased a marked decrease in the ability to generalize on downstream image-text tasks. This suggests that the prevalent bias has a critical impact on the model’s ability to learn vision-language semantics effectively.

Looking Towards a Bias-Free Future

In reaction to this urgency, a less biased LAION-2B dataset was created by excluding images with detected text. By re-training CLIP on this revised dataset, it was demonstrated that a balance could be struck where the model retains high performance without inheriting the unwanted text-spotting bias. This paper calls for immediate action to revisit data curation pipelines and model design to mitigate the influence of parrot captions and ensure models are truly learning integrative vision-language representations.

In summary, while CLIP has shown promise for advancing vision-language tasks, this investigation has sparked a critical reassessment of the biases entrenched in these models. As we move forward, it is imperative to develop refined protocols for data curation and model training to overcome the inclination towards text-spotting biases and fortify the authentic visual-language interpretation capabilities of AI models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021.
  2. A prompt array keeps the bias away: Debiasing vision-language models with adversarial learning. arXiv preprint arXiv:2203.11933, 2022.
  3. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  4. Less is more: Removing text-regions improves clip training efficiency and robustness. arXiv preprint arXiv:2305.05095, 2023.
  5. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  8. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  9. Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
  10. Opendatalab: Empowering general artificial intelligence with open datasets. https://opendatalab.com, 2022.
  11. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  12. Openclip, 2021.
  13. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021.
  14. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  15. Language-biased image classification: evaluation based on semantic representations. arXiv preprint arXiv:2201.11014, 2022.
  16. Language-driven semantic segmentation. In ICLR, 2022.
  17. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34:9694–9705, 2021.
  18. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  19. T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132, 2023.
  20. Disentangling visual and written concepts in clip. In CVPR, pages 16410–16419, 2022.
  21. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  22. Scikit-learn: Machine learning in Python. JMLR, 12:2825–2830, 2011.
  23. Combined scaling for zero-shot transfer learning. Neurocomputing, 555:126658, 2023.
  24. Filtering, distillation, and hard negatives for vision-language pre-training. In CVPR, pages 6967–6977, 2023.
  25. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  26. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  27. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  28. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
  29. Logoprompt: Synthetic text images can be good visual prompts for vision-language models. In ICCV, pages 2932–2941, 2023.
  30. What does clip know about a red circle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712, 2023.
  31. Alexandru Telea. An image inpainting technique based on the fast marching method. Journal of graphics tools, 9(1):23–34, 2004.
  32. Clippo: Image-and-language understanding from pixels only. In CVPR, pages 11006–11017, 2023.
  33. Are gender-neutral queries really gender-neutral? mitigating gender bias in image search. arXiv preprint arXiv:2109.05433, 2021.
  34. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  35. Devil in the number: Towards robust multi-modality data filter. arXiv preprint arXiv:2309.13770, 2023.
  36. Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
  37. Deepsolo: Let transformer decoder with explicit points solo for text spotting. In CVPR, pages 19348–19357, 2023.
  38. When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022.
  39. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pages 18123–18133, 2022.
  40. Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. IJCV, pages 1–22, 2023.
  41. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022.
  42. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yiqi Lin (14 papers)
  2. Conghui He (114 papers)
  3. Alex Jinpeng Wang (20 papers)
  4. Bin Wang (750 papers)
  5. Weijia Li (39 papers)
  6. Mike Zheng Shou (165 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com

HackerNews