What If We Recaption Billions of Web Images with LLaMA-3? (2406.08478v2)

Published 12 Jun 2024 in cs.CV and cs.CL

Abstract: Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-LLMs. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/

PDF HTML Abstract

Recaptioning Billions of Web Images Using LLaMA-3

The research paper "What If We Recaption Billions of Web Images with LLaMA-3?" addresses the prevailing issue of noisy and misaligned image-text datasets obtained through web crawling. The authors fine-tune the LLaMA-3-8B powered LLaVA-1.5 model to recaption 1.3 billion images from the DataComp-1B dataset, producing a new dataset named Recap-DataComp-1B. Their empirical results demonstrate significant advancements in training vision-LLMs, including improvements in cross-modal retrieval tasks and text-to-image generation.

Introduction and Motivation

The acquisition of massive datasets through web crawling has driven significant progress in deep learning over the past decade, as evidenced by datasets like LAION-400M and LAION-5B. However, these datasets suffer from quality issues such as misalignment and lack of descriptive detail. Prior work has shown that enhancing the textual descriptions of image-text pairs can improve the training of vision-LLMs. Existing methods to achieve this, such as human-in-the-loop systems or other automated pipelines, often remain closed-source.

The paper aims to bridge this gap by leveraging the open-sourced LLaMA-3 model, which exhibits capabilities comparable to GPT-4. The authors fine-tune a LLaMA-3-8B model within the LLaVA framework, and use this to recaption the entire DataComp-1B dataset. This new dataset, Recap-DataComp-1B, is intended to enhance the training of advanced vision-LLMs.

Related Works

The paper of vision-language foundation models has been significantly advanced by models such as CLIP, which link images and text by training on large-scale datasets. However, one enduring challenge is the quality of web-crawled image-text data, which often suffers from poor alignment and brevity of textual descriptions. Solutions include data filtering and recaptioning, with the latter gaining traction due to its ability to enrich textual quality. Notable efforts in this direction include the use of LLMs like BLIP2 and frameworks like LLaVA.

Recaptioning Pipeline

The authors use an enhanced LLaMA-3-powered LLaVA model for their recaptioning pipeline. The model comprises a vision encoder and a language decoder, with an MLP layer projecting visual features into the language embedding space. The pipeline involves two training stages: the first trains only the projection MLP, while the second fine-tunes both the MLP and the language decoder. Quality enhancements are validated through multi-modal benchmarks like MMMU and MM-Vet, where the recaptioning model demonstrates superior visual understanding and reasoning.

Data Recaptioning

The authors apply their fine-tuned model to the DataComp-1B dataset, generating enhanced captions for 1.3 billion images. This effort yields Recap-DataComp-1B, which features more detailed and better-aligned textual descriptions compared to the original DataComp-1B dataset. A quantitative analysis reveals that the recaptioned data exhibits richer vocabulary and longer average sequence lengths, confirming its superior quality.

CLIP Performance

The enhanced dataset was used to train various scales of CLIP models, and substantial improvements were observed in cross-modal retrieval tasks. The paper found that training with a mixture of original and recaptioned data could optimize performance, with a 20-50% ratio of recaption data being particularly effective. Enlarging the text encoder further boosted performance across all model scales.

Text-to-Image Generation

The authors also investigated the performance of text-to-image generative models using Recap-DataComp-1B. Their evaluations showed that models trained on this dataset produced images with better alignment to user text instructions and higher visual quality. A larger model (DiT-L/2) exhibited improved text understanding and image generation, highlighting the scalability of Recap-DataComp-1B for generative tasks.

Conclusion

The paper effectively demonstrates that fine-tuning a LLaMA-3-powered LLaVA model to recaption large-scale image datasets like DataComp-1B can significantly improve the quality of the textual descriptions. The resultant dataset, Recap-DataComp-1B, offers substantial benefits for training advanced vision-LLMs, enhancing both cross-modal retrieval and text-to-image generation capabilities. The release of Recap-DataComp-1B is expected to stimulate further research and development in the open-source community, advancing the capabilities of vision-language foundation models.