MMTEB: Massive Multilingual Text Embedding Benchmark (2502.13595v3)

Published 19 Feb 2025 in cs.CL, cs.AI, and cs.IR

Abstract: Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while LLMs with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.

PDF Abstract

Analysis of the Massive Multilingual Text Embedding Benchmark (MMTEB)

The paper "MMTEB: Massive Multilingual Text Embedding Benchmark" focuses on addressing significant limitations in the evaluation of multilingual text embedding models. Traditional text embedding benchmarks often suffer from restricted language and task diversity, which limits their applicability across various domains and lower-resourced languages. MMTEB, as proposed, fills this gap by offering a considerable expansion of the Multilingual Text Embedding Benchmark (MTEB), thus providing a more extensive and comprehensive evaluation suite.

Objectives and Scope

The core objective of MMTEB is to evaluate text embeddings on a broader and more diverse set of tasks, spanning numerous languages and domains. The benchmark includes over 500 quality-controlled evaluation tasks across 250+ languages, making it the largest multilingual collection of its kind. This extensive coverage is aimed at ensuring that text embeddings are robustly assessed across different contexts, including instruction following, long-document retrieval, and code retrieval.

Methodology

MMTEB leverages a community-driven approach to gather a wide range of tasks. The benchmark introduces novel evaluation methodologies, including a downsampling method based on inter-task correlation, which optimizes the number of tasks while maintaining the integrity of model assessments. The methodology also focuses on sampling hard negatives to create efficient splits, reducing computational costs while ensuring that the benchmarks reflect accurate model rankings.

The paper outlines an open science approach, inviting contributions from a diverse group of experts. This community-driven effort ensures high-quality data collection through systematic tests and reviews by native speakers and domain experts. The tasks included in MMTEB cover various domains such as fiction, social media, medical texts, and technical documentation, hence providing a multi-faceted perspective on model evaluation.

Experimental Insights

MMTEB includes several experimental procedures to demonstrate its utility. Notable findings reveal that while LLMs with billions of parameters can achieve state-of-the-art performances on specific language subsets, competitive outcomes are also observed with smaller models like multilingual-e5-large-instruct, which has only 560 million parameters. This observation suggests that model efficiency can be maintained with significantly fewer parameters while optimizing for computational costs.

An interesting development within this benchmark is the introduction of a zero-shot English benchmark, which maintains similar model ranking integrity as its full-scale version but drastically reduces computational requirements by 98%. The paper illustrates the balance between computational efficiency and performance through benchmark optimizations without sacrificing evaluation rigor.

Implications and Future Directions

Practically, the implications of MMTEB are significant for the development and evaluation of multilingual embeddings. By providing a standardized and extensive evaluation platform, MMTEB facilitates the comparison and assessment of embedding models across a broad spectrum of languages and tasks. It underscores the importance of handling lower-resourced languages with the same rigor as more widely spoken ones.

Theoretically, MMTEB advances the understanding of multilingual model performance by showcasing the strengths and limitations of diverse architectures under more realistic evaluation scenarios. The benchmark stimulates further research in leveraging smaller, yet effective, models for multilingual applications, particularly in low-resource scenarios. This research trajectory aligns with increasing demands for efficient and accessible machine learning models globally.

Conclusion

The MMTEB initiative represents a critical advancement in the evaluation of text embedding models, offering the most comprehensive multilingual benchmark to date. Through its community-driven development and rigorous methodological approach, it significantly contributes to the field of NLP. Looking forward, the paper implies potential research directions in optimizing model architectures for diverse linguistic applications and explores how future AI systems can be more inclusive of global languages and their complexities.

PDF Markdown Bookmark Chat (Pro)

Authors (86)

Kenneth Enevoldsen (11 papers)
Isaac Chung (5 papers)
Imene Kerboua (4 papers)
Márton Kardos (7 papers)
Ashwin Mathur (2 papers)
David Stap (13 papers)
Jay Gala (13 papers)
Wissam Siblini (8 papers)
Dominik Krzemiński (6 papers)
Genta Indra Winata (94 papers)
Saba Sturua (8 papers)
Saiteja Utpala (12 papers)
Mathieu Ciancone (3 papers)
Marion Schaeffer (3 papers)
Gabriel Sequeira (1 paper)
Diganta Misra (17 papers)
Shreeya Dhakal (1 paper)
Jonathan Rystrøm (3 papers)
Roman Solomatin (3 papers)
Ömer Çağatan (1 paper)

Related Papers

Find Related Papers

Tweets

https://twitter.com/KCEnevoldsen/status/1892515671646376163

https://twitter.com/_reachsumit/status/1892466727390151053

https://twitter.com/tomaarsen/status/1912438353082917240

https://twitter.com/tomaarsen/status/1892955162483802453

https://twitter.com/tomaarsen/status/1892952517161468241