Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Meta CLIP 2: Global Vision-Language Scaling

Updated 1 August 2025
  • The paper introduces a unified multilingual pretraining pipeline that achieves state-of-the-art performance on both English and global benchmarks.
  • It employs worldwide metadata construction with language-specific curation algorithms to effectively balance high- and low-resource languages.
  • Scaling data volume and model capacity with optimized tokenization eliminates the curse of multilinguality in CLIP models.

Meta CLIP 2 is a worldwide scaling recipe for contrastive language–image pretraining, which directly addresses the challenge of extending CLIP's robust vision–language alignment from English-centric to truly global web-scale datasets. By introducing unified curation, scaling, and training strategies, Meta CLIP 2 enables mutual improvements for both English and non-English performance, overcoming the long-standing “curse of multilinguality” and establishing new state-of-the-art results on multilingual vision–language benchmarks (Chuang et al., 29 Jul 2025).

1. Challenges in Scaling CLIP to Worldwide Data

Meta CLIP 2 tackles two critical impediments in global vision–language pretraining:

  • Lack of Multilingual Data Curation: Traditional CLIP pipelines curate training data using English-only metadata (e.g., from WordNet or Wikipedia), discarding roughly half the image–text pairs available on the worldwide web because they are not in English. This exclusion leads to limited cultural and linguistic representativeness and reduces global alignment.
  • Curse of Multilinguality: Prior efforts in multilingual CLIP, such as mSigLIP and SigLIP 2, suffer a decline in English performance—up to 1.5% lower zero-shot top-1 accuracy on ImageNet compared to English-only pretraining—while attempting to support multiple languages within a unified model.

Meta CLIP 2 addresses these by designing multilingual metadata and curation pipelines and by scaling both the training set size and model capacity to eliminate these trade-offs.

2. Training and Data Curation Recipe

The Meta CLIP 2 recipe consists of three principal components designed to efficiently scale CLIP’s training to billions of worldwide images:

  1. Worldwide Metadata Construction:
    • Constructs multilingual metadata from >300 languages, aggregating (i) WordNet for 31 languages, (ii) unigrams/bigrams from Wikipedia in 329 languages, and (iii) Wikipedia titles.
    • For languages lacking whitespace (Chinese, Japanese, Thai), applies rule-based tokenizers to generate suitable substrings for substring matching.
  2. Worldwide Curation Algorithm:
    • Identifies the language of each alt-text, then matches against language-specific metadata via fast substring search (Aho–Corasick algorithm).
    • Computes global metadata counts per language. A “tail-match ratio” threshold (≈6%) is enforced analytically per language, yielding a threshold tlangt_{lang} that balances “head” and “tail” concepts for each.
    • Each image–text pair is sampled with probability min(1,tlang/\min(1, t_{lang}/entry_count)), which guarantees all tail entries are kept and prevents head concepts from dominating, even in high-resource or low-resource languages.
  3. Training Adjustments:
    • When non-English data additions grow raw sample size (e.g., from 12.8B to 29B), batch sizes are scaled by 2.3×\times so that English instances are seen as frequently as in the purely English case.
    • Larger model architectures are needed to absorb and leverage worldwide data; the ViT-H/14 (bigger than ViT-L/14) is empirically shown to jointly improve both English and multilingual performance.

3. Empirical Performance and Metrics

Meta CLIP 2 demonstrates its effectiveness across both English and multilingual settings:

Model English (ImageNet, R@1) Babel-ImageNet (avg acc) XM3600 (i2t R@1) CVQA (i2t acc)
English-only 80.5%
Meta CLIP 2 81.3% 50.2% 64.3% 57.4%
mSigLIP <80.6% <48.7% <63.0% <56.5%
  • On ImageNet (English), Meta CLIP 2 ViT-H/14 surpasses the English-only baseline by 0.8%.
  • On Babel-ImageNet (average over many languages), it reaches 50.2% zero-shot top-1 accuracy.
  • On XM3600 (multilingual image-to-text retrieval), recall@1 achieves 64.3%.
  • CVQA (culturally diverse VQA): 57.4% image-to-text retrieval accuracy.

The empirical success is achieved without system-level confounders such as translation pipelines or architecture modifications, isolating the effect of data curation and scaling alone.

Other technical metrics, including alignment/uniformity measures over a multilingual holdout set, indicate improved embedding quality and reduced cultural bias relative to previous methods such as mSigLIP and SigLIP 2.

4. Ablations and Minimal Necessary Modifications

Systematic ablations validate each component of the recipe:

  • Language Isolation: Mixing metadata/alt-texts without language separation degrades both English and multilingual accuracy.
  • Language-specific Thresholds: Using a single English threshold degrades performance for low-resource languages; computing per-language tlangt_{lang} using the tail-match ratio is critical.
  • Tokenization Choices: Encoder vocabulary from XLM-V (\sim900k tokens) yields the highest accuracy (across English and non-English) compared to mT5, Gemma, or XLM-R.
  • Scaling Seen Pairs: Keeping the number of English (and all other) samples seen constant, even as batch size grows, is necessary to avoid sacrificing English domain performance.
  • Model Scale: With ViT-L/14, the curse of multilinguality persists at large scale; only ViT-H/14 (with greater capacity) achieves mutually boosted English and worldwide performance.

5. State-of-the-Art Results and Broader Benchmarks

Meta CLIP 2 sets new performance records in diverse evaluation settings, including:

  • Babel–ImageNet: 50.2% average zero-shot classification, highest among unified models.
  • XM3600: 64.3% R@1 for image-to-text retrieval.
  • GeoDE/Dollar Street (appendix): Improvements observed in culturally diverse and geo-specific benchmarks.
  • CVQA: The diverse question-answering scenario sees leading results.

No previous multilingual CLIP, including mSigLIP or SigLIP 2, demonstrates the simultaneous and substantial improvements in both English and non-English regimes as attained by Meta CLIP 2 with unified architecture and minimal algorithmic modification.

6. Implications, Applications, and Future Directions

The Meta CLIP 2 recipe offers a blueprint for the next generation of globally representative vision–LLMs:

  • Global Data Inclusion: By training on the entire worldwide multimodal web, cultural and regional representativeness is improved, informing more universally applicable models.
  • Streamlined Model Deployment: Unified, language-agnostic models obviate the need for region-specific variants or bespoke translation infrastructure.
  • Blueprint for Foundation Models: Open-sourced metadata, language-aware balancing, and capacity scaling are immediately applicable to ongoing multimodal LLM developments, such as instruction-tuned models (e.g., LLaVA-NeXT, Qwen-VL).
  • Enabling Localization: Improved geo-specific and culturally tailored performance (as shown in appended benchmarks) supports future research in geo-localization, region-aware search, and culturally fluent MLLMs.
  • Scalability: Adaptive curation and multilingual pipeline design make further extension to underrepresented and low-resource languages practical; open-sourcing these recipes and resources lowers entry barriers for follow-up work.

The key finding—that scaling model capacity and curation with language-aware design not only eliminates the curse of multilinguality but unlocks mutual benefit—suggests that future foundation models can embrace truly global web-scale multimodal data to produce unimpaired and even superior performance in both English and non-English applications. Meta CLIP 2 thus marks a significant development in the global scaling of vision–language pretraining, setting new standards for inclusivity, scalability, and multilingual robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)