Parrot Captions Teach CLIP to Spot Text (2312.14232v3)

Published 21 Dec 2023 in cs.CV and cs.AI

Abstract: Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

PDF HTML Abstract

Understanding CLIP's Visual Text Bias

Text Bias in Vision-LLMs

CLIP (Contrastive Language–Image Pretraining) serves as a significant foundation in the field of AI, aiding numerous vision-language tasks. The model's ability to link visual content with relevant text descriptions is pivotal. Unfortunately, it has been brought to light that CLIP models exhibit a potent bias in identifying and fixating on embedded text within images. This propensity for "parroting"—mimicking the text in the pictures rather than interpreting the genuine visual content—raises serious concerns about the efficacy of these models in comprehending visual semantics.

To explore this, a comprehensive paper was conducted on the predominant image-text dataset, LAION-2B, revealing that a considerable portion of the dataset's images contains visual text. Moreover, a large majority of the image captions directly repeat words found in the embedded image text, underscoring the emphasis that CLIP models place on the text within images, at times to the detriment of the visual context.

The Role of Dataset Curation

The analysis further disclosed that the current trend of data curation, which heavily relies on CLIP model-derived scores, is unintentionally but inevitably promoting this visual text bias. By studying various versions of CLIP models, it became clear that the models are indeed text-centric in evaluating image-text pairs. Training these models on data subsets curated with a focus on embedded text significantly shaped the models' text spotting capabilities and, as an unintended consequence, impaired their understanding of visual context.

Consequences for Representation Learning

Diving deeper into the influence of such biases, a series of CLIP models were trained on carefully chosen subsets of LAION-2B data. Models trained on data with high "parrot captions," though adept at spotting text, showcased a marked decrease in the ability to generalize on downstream image-text tasks. This suggests that the prevalent bias has a critical impact on the model’s ability to learn vision-language semantics effectively.

Looking Towards a Bias-Free Future

In reaction to this urgency, a less biased LAION-2B dataset was created by excluding images with detected text. By re-training CLIP on this revised dataset, it was demonstrated that a balance could be struck where the model retains high performance without inheriting the unwanted text-spotting bias. This paper calls for immediate action to revisit data curation pipelines and model design to mitigate the influence of parrot captions and ensure models are truly learning integrative vision-language representations.

In summary, while CLIP has shown promise for advancing vision-language tasks, this investigation has sparked a critical reassessment of the biases entrenched in these models. As we move forward, it is imperative to develop refined protocols for data curation and model training to overcome the inclination towards text-spotting biases and fortify the authentic visual-language interpretation capabilities of AI models.