This paper, "Demystifying CLIP Data" (Xu et al., 2023 ), investigates the crucial role of data curation in the success of Contrastive Language-Image Pre-training (CLIP) models. The authors argue that CLIP's performance stems primarily from its high-quality, albeit proprietary, WIT400M dataset, rather than the model architecture or training objective. The original CLIP paper provided limited details on data curation, leading subsequent works like LAION to rely on filtering data using pre-trained CLIP models, effectively distilling information rather than replicating the original curation process.
To address this lack of transparency, the paper introduces Metadata-Curated Language-Image Pre-training (MetaCLIP), aiming to replicate and open-source CLIP's likely curation strategy. The core idea is to filter and balance a raw data pool using metadata derived from CLIP's concepts.
MetaCLIP Curation Process:
- Metadata Construction: The authors first reconstruct the metadata (referred to as "queries" or "entries" by CLIP) used for curation. This metadata comprises approximately 500,000 entries derived from four sources:
- All WordNet synsets.
- Frequent unigrams (count >= 100) from English Wikipedia.
- High Pointwise Mutual Information (PMI >= 30) bigrams from English Wikipedia.
- High-traffic (view frequency >= 70) Wikipedia article titles.
- Sub-string Matching: A raw pool of image-text pairs (e.g., from CommonCrawl) is processed. Pairs are kept only if their text caption contains at least one of the metadata entries as a sub-string. This step acts as an initial quality filter, removing pairs with low-quality or irrelevant text (e.g., date strings, IDs) without explicit rules. Approximately 50% of English text pairs are retained.
- Balancing (The Key Step): The distribution of image-text pairs matched per metadata entry is highly long-tailed (a few entries match millions of pairs, many match very few). CLIP addressed this by limiting the number of pairs per entry to a maximum threshold, t (estimated to be 20,000 for WIT400M). MetaCLIP replicates this:
- Entries matching fewer than t pairs keep all associated pairs (tail entries).
- Entries matching more than t pairs are sub-sampled down to t pairs (head entries).
- This balancing significantly reduces noise from overly common terms (e.g., "photo", "image") and diversifies the dataset by boosting the relative representation of rarer concepts (tail entries). Pairs whose text matches multiple entries have a higher chance of being selected.
- Scalable Algorithm: The paper presents a practical algorithm (Algorithm 1) that implements this curation without needing to build a potentially massive inverted index (mapping entries to all their matching pairs).
- Part 1: Calculate the total match count for each metadata entry across the raw data pool (
entry_count
). - Part 2: Calculate a sampling probability for each entry: . Iterate through each image-text pair in the raw pool. For each pair, iterate through its matched metadata entries. If a random draw is less than the sampling probability for any of its matched entries (
random.random() < entry_prob[entry_id]
), the pair is kept; otherwise, it's discarded.
- Part 1: Calculate the total match count for each metadata entry across the raw data pool (
Implementation and Experiments:
- The authors applied MetaCLIP to CommonCrawl data, creating datasets of 400M, 1B, and 2.5B pairs.
- They strictly followed the original CLIP training setup (ViT models, batch size, total seen pairs fixed at 12.8B) to isolate the effect of data.
- Evaluations were performed on zero-shot ImageNet classification and a suite of 26 other benchmarks (plus a 38-task benchmark from DataComp).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
D_star = [] entry_count = substr_matching(D, M) # Needs efficient implementation entry_prob = {} for entry_id, count in entry_count.items(): # Apply threshold t for probability calculation capped_count = max(count, t) # Treat entries with count < t as if count = t for probability entry_prob[entry_id] = t / capped_count # Probability is t / count for head, 1 for tail (t/t) for image, text in D: # Assume text object has attribute text.matched_entry_ids (list of entry IDs) keep_pair = False for entry_id in text.matched_entry_ids: if entry_id in entry_prob: # Ensure entry exists in calculated probs if random.random() < entry_prob[entry_id]: keep_pair = True break # Keep the pair as soon as one matching entry passes the check if keep_pair: D_star.append((image, text)) |
Key Findings:
- MetaCLIP outperforms CLIP: MetaCLIP-400M data trained with ViT-B/16 achieves 70.8% zero-shot ImageNet accuracy, compared to 68.3% for CLIP's WIT400M. Similar gains are observed across model sizes and average benchmark performance. MetaCLIP also outperforms LAION-400M data.
- Balancing is crucial: Without balancing (t=infinity), performance drops significantly, even below raw data baselines despite using more matched data. Balancing is shown to improve data quality metrics (image/text quality, alignment) in human studies (Appendix \ref{sec:curation_analysis}). The threshold t=20k used by CLIP appears optimal for the 400M scale.
- Scaling works: Scaling MetaCLIP data to 1B and 2.5B (while keeping training compute constant) further improves performance, reaching 79.2% on ImageNet with ViT-L/14 and 82.1% with ViT-bigG/14 on the 2.5B dataset.
- Efficiency: The MetaCLIP algorithm allows efficient integration into data processing pipelines, enabling filtering and balancing before expensive steps like image downloading, drastically reducing storage and transfer needs. Online balancing during data loading is also effective.
Conclusion:
The paper successfully demystifies CLIP's data curation, presenting a reproducible method (MetaCLIP) based on metadata matching and balancing. It demonstrates that this curation strategy, applied to publicly available CommonCrawl data, yields datasets that outperform CLIP's original proprietary data. The work highlights the critical importance of careful, distribution-aware data curation for training large vision-LLMs and provides the tools for the community to create such datasets openly. The code and data distribution details are made available.