DataComp: In search of the next generation of multimodal datasets (2304.14108v5)

Published 27 Apr 2023 in cs.CV, cs.CL, and cs.LG

Abstract: Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai.

PDF Abstract

DataComp: In Search of the Next Generation of Multimodal Datasets

The paper "DataComp: In Search of the Next Generation of Multimodal Datasets" addresses a critical yet often under-researched component of machine learning advancements: the development of large-scale multimodal datasets consisting of paired image-text samples. While significant progress in AI has been achieved through the design of sophisticated model architectures and training algorithms, the dataset component remains less explored. DataComp seeks to fill this gap by presenting a benchmark designed to evaluate dataset curation techniques, particularly focusing on multimodal datasets that feed into powerful models like CLIP, Stable Diffusion, and GPT-4.

The benchmark, DataComp, introduces a candidate pool of 12.8 billion image-text pairs retrieved from Common Crawl, with a focus on creating a playground for dataset experimentation where researchers can test various filtering techniques or curate new data sources. Researchers are tasked to design datasets and evaluate them using standardized CLIP training code, thereafter testing the resulting model on a diverse suite of 38 downstream tasks, different scales of compute requirements ensure accessibility and scalability for the researchers involved.

The methodology involves several key innovations:

Filtering Techniques: The paper explores various filtering strategies for dataset construction, from simple methods like language and size filtering to more complex, approach-based filtering using CLIP scores. It is valuable to see a quantitative assessment on how data curation methods impact machine learning outcomes.
Baseline Comparisons: Baselines help in ranking different filtering and curation methods. Importantly, the baseline experiments demonstrate that smaller, yet more stringently filtered datasets can outperform larger, less curated ones, answering to the false economy of "more data is always better."
BYOD (Bring Your Own Data) Track: This track complements the filtering approach by encouraging the combination of CommonPool data with other publicly available datasets, facilitating investigation into how different datasets might synergize.
Rigorous Evaluation Framework: DataComp uses a zero-shot performance evaluation scheme, testing models on unseen tasks to measure inherent understanding and adaptability. This method provides an immediate window into models' generalization capabilities, which is critical for real-world applications.

The research demonstrates strong numerical results in dataset effectiveness, evidenced by significant data-obsessed improvements to CLIP model performance. Their best curated dataset, "DataComp," improves ImageNet zero-shot accuracy to 79.2% using the ViT-L/14 model, a 3.7-percentage point increase over OpenAI's CLIP of the same architecture, implying better tuning of the data subset can deliver significant performance boosts.

Implications and Future Directions

The implications of this work are broad and multifaceted, affecting both practical and theoretical dimensions of AI research. On a practical level, DataComp provides a structured benchmark to foster replicable, rigorous experimentation in dataset design across differing resources and compute scales. This enables widespread participation by disparate researchers, accelerating progress in the community.

Theoretically, exploring how smaller, high-fidelity datasets can outperform larger ones invites a reconsideration of the data scaling laws traditionally held in high regard. Furthermore, the focus on dataset curation illuminates potential biases inherent in large-scale data gathering operations, especially when scrutinized with the same rigor model architectures have been.

As for future implications, DataComp establishes a promising scaffold on which the community can build. By opening pathways to better understand data curation, subsequent advancements in model training will likely build on these findings. Moreover, the scalability of DataComp's approach allows for broader modalities to be introduced, potentially expanding beyond image-text pairs to video, audio, and beyond.

DataComp's introduction positions it as a flagship in the data-centric AI movement, pushing for the next wave of AI improvements to emerge not simply from algorithmic novelty but through robust investigation into the unseen scaffolding of AI: the dataset. This benchmarks the dawn of data-driven machine learning enhancement.