Demystifying CLIP Data (2309.16671v4)

Published 28 Sep 2023 in cs.CV and cs.CL

Abstract: Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.

PDF Abstract

This paper, "Demystifying CLIP Data" (Xu et al., 2023 ), investigates the crucial role of data curation in the success of Contrastive Language-Image Pre-training (CLIP) models. The authors argue that CLIP's performance stems primarily from its high-quality, albeit proprietary, WIT400M dataset, rather than the model architecture or training objective. The original CLIP paper provided limited details on data curation, leading subsequent works like LAION to rely on filtering data using pre-trained CLIP models, effectively distilling information rather than replicating the original curation process.

To address this lack of transparency, the paper introduces Metadata-Curated Language-Image Pre-training (MetaCLIP), aiming to replicate and open-source CLIP's likely curation strategy. The core idea is to filter and balance a raw data pool using metadata derived from CLIP's concepts.

MetaCLIP Curation Process:

Metadata Construction: The authors first reconstruct the metadata (referred to as "queries" or "entries" by CLIP) used for curation. This metadata comprises approximately 500,000 entries derived from four sources:
- All WordNet synsets.
- Frequent unigrams (count >= 100) from English Wikipedia.
- High Pointwise Mutual Information (PMI >= 30) bigrams from English Wikipedia.
- High-traffic (view frequency >= 70) Wikipedia article titles.
Sub-string Matching: A raw pool of image-text pairs (e.g., from CommonCrawl) is processed. Pairs are kept only if their text caption contains at least one of the metadata entries as a sub-string. This step acts as an initial quality filter, removing pairs with low-quality or irrelevant text (e.g., date strings, IDs) without explicit rules. Approximately 50% of English text pairs are retained.
Balancing (The Key Step): The distribution of image-text pairs matched per metadata entry is highly long-tailed (a few entries match millions of pairs, many match very few). CLIP addressed this by limiting the number of pairs per entry to a maximum threshold, t (estimated to be 20,000 for WIT400M). MetaCLIP replicates this:
- Entries matching fewer than t pairs keep all associated pairs (tail entries).
- Entries matching more than t pairs are sub-sampled down to t pairs (head entries).
- This balancing significantly reduces noise from overly common terms (e.g., "photo", "image") and diversifies the dataset by boosting the relative representation of rarer concepts (tail entries). Pairs whose text matches multiple entries have a higher chance of being selected.
Scalable Algorithm: The paper presents a practical algorithm (Algorithm 1) that implements this curation without needing to build a potentially massive inverted index (mapping entries to all their matching pairs).
- Part 1: Calculate the total match count for each metadata entry across the raw data pool (entry_count).
- Part 2: Calculate a sampling probability for each entry: $p_{entry} = \min(1, t / \text{count}(entry))$ . Iterate through each image-text pair in the raw pool. For each pair, iterate through its matched metadata entries. If a random draw is less than the sampling probability for any of its matched entries (random.random() < entry_prob[entry_id]), the pair is kept; otherwise, it's discarded.

Implementation and Experiments:

The authors applied MetaCLIP to CommonCrawl data, creating datasets of 400M, 1B, and 2.5B pairs.
They strictly followed the original CLIP training setup (ViT models, batch size, total seen pairs fixed at 12.8B) to isolate the effect of data.
Evaluations were performed on zero-shot ImageNet classification and a suite of 26 other benchmarks (plus a 38-task benchmark from DataComp).

D_star = []

entry_count = substr_matching(D, M) # Needs efficient implementation

entry_prob = {}
for entry_id, count in entry_count.items():
    # Apply threshold t for probability calculation
    capped_count = max(count, t) # Treat entries with count < t as if count = t for probability
    entry_prob[entry_id] = t / capped_count # Probability is t / count for head, 1 for tail (t/t)

for image, text in D:
    # Assume text object has attribute text.matched_entry_ids (list of entry IDs)
    keep_pair = False
    for entry_id in text.matched_entry_ids:
        if entry_id in entry_prob: # Ensure entry exists in calculated probs
            if random.random() < entry_prob[entry_id]:
                keep_pair = True
                break # Keep the pair as soon as one matching entry passes the check
    if keep_pair:
        D_star.append((image, text))

Key Findings:

MetaCLIP outperforms CLIP: MetaCLIP-400M data trained with ViT-B/16 achieves 70.8% zero-shot ImageNet accuracy, compared to 68.3% for CLIP's WIT400M. Similar gains are observed across model sizes and average benchmark performance. MetaCLIP also outperforms LAION-400M data.
Balancing is crucial: Without balancing (t=infinity), performance drops significantly, even below raw data baselines despite using more matched data. Balancing is shown to improve data quality metrics (image/text quality, alignment) in human studies (Appendix \ref{sec:curation_analysis}). The threshold t=20k used by CLIP appears optimal for the 400M scale.
Scaling works: Scaling MetaCLIP data to 1B and 2.5B (while keeping training compute constant) further improves performance, reaching 79.2% on ImageNet with ViT-L/14 and 82.1% with ViT-bigG/14 on the 2.5B dataset.
Efficiency: The MetaCLIP algorithm allows efficient integration into data processing pipelines, enabling filtering and balancing before expensive steps like image downloading, drastically reducing storage and transfer needs. Online balancing during data loading is also effective.

Conclusion:

The paper successfully demystifies CLIP's data curation, presenting a reproducible method (MetaCLIP) based on metadata matching and balancing. It demonstrates that this curation strategy, applied to publicly available CommonCrawl data, yields datasets that outperform CLIP's original proprietary data. The work highlights the critical importance of careful, distribution-aware data curation for training large vision-LLMs and provides the tools for the community to create such datasets openly. The code and data distribution details are made available.