VeCLIP: Improving CLIP Training via Visual-enriched Captions (2310.07699v3)

Published 11 Oct 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Large-scale web-crawled datasets are fundamental for the success of pre-training vision-LLMs, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing LLMs for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M. This study introduces a scalable pipeline for noisy caption rewriting. Unlike recent LLM rewriting techniques, we emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap). To ensure data diversity, we propose a novel mixed training scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP. Employing this cost-effective pipeline, we effortlessly scale our dataset up to 300 million samples named VeCap dataset. Our results show significant advantages in image-text alignment and overall model performance. For example, VeCLIP achieves up to +25.2% gain in COCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency, VeCLIP achieves +3% gain while only using 14% of the data employed in the vanilla CLIP and 11% in ALIGN. We also note the VeCap data is complementary with other well curated datasets good for zero-shot classification tasks. When combining VeCap and DFN, our model can achieve strong performance on both of image-text retrieval and zero-shot classification tasks, e.g. 83.1% accuracy@1 on ImageNet zero-shot for a H/14 model. We release the pre-trained models at https://github.com/apple/ml-veclip.

PDF Abstract

VeCLIP: Enhancing CLIP Performance via Visual-enriched Captions

The paper "VeCLIP: Improving CLIP Training via Visual-enriched Captions" presents a novel method to enhance CLIP (Contrastive Language-Image Pre-training) models by generating and leveraging visually enriched captions to improve image-text alignment. By addressing the issues inherent in web-crawled datasets, this research enhances CLIP's training data quality, diversity, and scalability.

Core Contributions

The authors introduce a scalable pipeline for rewriting noisy captions, termed Visual-enriched Captions (VeCap), through the incorporation of visual concepts directly from images. This technique diverges from conventional LLM-based caption rewriting that has been predominantly effective on curated datasets, demonstrating significant utility and scalability on large-scale, web-crawled datasets.

Visual-enriched Captions (VeCap): VeCap is generated through a two-step process. First, LLaVA, a multimodal Language-Vision Assistant, extracts visual concepts from images independently of AltText. Second, an LLM refines these captions, fusing disparate sources of information into one concise text without altering its core semantic meaning.
Mixed Training Scheme: To combat potential drops in data diversity caused by unified LLM rewriting styles, VeCLIP employs a mixed training strategy, alternating between original AltTexts and the enhanced VeCap, thereby maintaining information richness and preventing overfitting.
Data Efficiency and Scalability: VeCLIP notably achieves substantial improvements using a significantly smaller dataset compared to established models like CLIP trained on 400 million pairs. It demonstrates that 3M dataset settings yield an R@1 improvement of over 16\% on retrieval tasks compared to standard CLIP, while using less than 14\% of the data required by the original models.

Numerical Results and Performance

VeCLIP's performance is rigorously evaluated on image-to-text and text-to-image retrieval tasks using datasets such as COCO and Flickr30k, demonstrating remarkable enhancements in Recall@1, Recall@5, and Recall@10 metrics. For instance, VeCLIP achieves up to +25.2\% improvement in COCO retrieval tasks over the 12M dataset setting. Furthermore, in zero-shot ImageNet classification, the performance improvements further confirm the method’s robustness and practicality.

Implications and Future Directions

The implications of this research are palpable in both theoretical and practical spheres. Theoretically, it introduces a sustainable model of enhancing LLM-based rewriting tasks by leveraging visual context, setting a precedent for handling noisy, real-world datasets. Practically, the approach reduces the need for large, manually annotated datasets, facilitating more efficient resource utilization in model training.

Future developments could examine the integration of VeCap with other dataset curation methodologies, or investigate its impact on different multimodal models beyond CLIP. Additionally, addressing potential ethical and factual inaccuracies or hallucinations introduced by LLMs merits further scrutiny to refine the overall quality of generated captions.

In conclusion, the work presents a compelling advancement in optimizing CLIP training in noisy environments, equipping models with the ability to better leverage large-scale, web-derived data while achieving superior benchmark performances. The balance between caption quality and model diversity appears crucial, suggesting that similar methodologies could be extended across various AI systems, further broadening the horizons of scalable AI training paradigms.