Improving Multimodal Datasets with Image Captioning (2307.10350v2)

Published 19 Jul 2023 in cs.LG and cs.CV

Abstract: Massive web datasets play a key role in the success of large vision-LLMs like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity. The synthetic captions used in our experiments are now available on HuggingFace.

PDF Abstract

Improving Multimodal Datasets with Image Captioning

The paper "Improving Multimodal Datasets with Image Captioning" presents an extensive investigation into enhancing the quality of noisy image-text datasets used for training vision-LLMs, specifically focusing on the potential of synthetic captions generated by image captioning models. The authors argue that while traditional filtering methods effectively reduce noise within raw datasets sourced from the web, they often compromise data diversity—a crucial aspect for robust multimodal model training.

Content Overview

The research centers around leveraging synthetic captions to counteract the loss of potentially valuable data discarded by conventional filtering approaches. Using the DataComp benchmark as a controlled environment, the authors conduct rigorous experiments with models such as BLIP2 and OpenCLIP-CoCa, assessing their effectiveness in generating captions that augment training data's utility. These models are selected based on their capabilities, yet intriguingly, the paper reveals that fine-tuning models to perform better on established image captioning benchmarks (e.g., CIDEr) does not correlate with improved utility in multimodal training. Instead, metrics like CLIP-S provide a more accurate reflection of the quality of captions for aligning with image content.

Experimental Methodology

Their methodology involves multiple data preprocessing strategies to integrate raw and synthetic captions. By exploring different combinations of these data sources, such as mixing filtered raw captions with additional synthetic text, the researchers identify configurations that outperform existing methods. These experiments are conducted across varying dataset scales (12.8M, 128M, and 1.28B image-text pairs), revealing nuanced insights into the interplay between caption quality and dataset size. Notably, synthetic captions provide significant boosts in zero-shot retrieval performance, highlighting the potential advantages of improving caption fidelity.

Key Findings

Synthetic captions dramatically improve retrieval capabilities—a critical requirement in multimodal models. While these synthetic captions enhance individual caption noise and information richness, they offer lower diversity than raw captions. Thus, optimal performance is achieved through a symbiotic utilization of both caption types. Additionally, at increasing scales, the importance of image curation becomes more pronounced, suggesting a need to refine both text and image quality concurrently to sustain improvements.

Implications and Future Directions

The research has significant implications for the future design of multimodal datasets, proposing synthetic captions as a viable alternative to collect massive datasets beyond typical web scraping limitations. This approach may unlock access to previously unused images and facilitate the creation of more comprehensive datasets without extensive human annotation. Future studies might focus on addressing the text diversity gap at larger scales and further enhancements of image and caption quality, considering potential biases inherent in synthetic data generation.

The findings from this paper contribute crucial knowledge to the field of AI, providing pathways to efficiently harness synthetic data for robust, large-scale model training—a step forward in improving the realism and utility of models trained on multimodal datasets.