MMTEB: Multilingual Text Embedding Benchmark

Updated 21 November 2025

MMTEB is a comprehensive, language-agnostic evaluation suite covering over 250 languages and 500 tasks for text embedding models.
It employs diverse tasks such as retrieval, classification, clustering, and semantic similarity with standardized metrics for robust model comparison.
MMTEB drives model innovation by enabling reproducible evaluations, continuous leaderboards, and cross-lingual performance improvement.

The Massive Multilingual Text Embedding Benchmark (MMTEB) is a comprehensive, language-agnostic evaluation suite designed to assess the performance of off-the-shelf text embedding models across a wide range of tasks and linguistic diversity. MMTEB supersedes the earlier Massive Text Embedding Benchmark (MTEB), which was primarily English-centric, by expanding coverage to over 250 languages and more than 500 tasks. The benchmark encompasses numerous evaluation axes, encompassing retrieval, classification, clustering, bitext mining, reranking, semantic textual similarity, and summarization, thereby providing an essential resource for universal model evaluation in multilingual natural language processing (Ren et al., 14 Nov 2025).

1. Evolution and Scope

MMTEB systematically addresses the limitations of prior evaluation standards by merging multilingual extensibility with multi-task comprehensiveness. While its predecessor MTEB (Muennighoff et al., 2022) unified eight core embedding tasks (bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity, summarization) over 58 datasets and 112 languages, MMTEB expands these dimensions by orders of magnitude. Notably, MMTEB covers both high-resource and low-resource languages, integrating datasets across Indo-European, Sino-Tibetan, Afro-Asiatic, Niger-Congo, Dravidian, Uralic, and others. The suite is not limited to text, accommodating domain-extension (see: MTEB-Code and MTEB-French (Ciancone et al., 2024)) and regional derivatives such as AfriMTEB (59 African languages, 38 datasets) (Uemura et al., 27 Oct 2025) and ruMTEB (Russian, 23 datasets) (Snegirev et al., 2024).

2. Task Families and Datasets

MMTEB features a taxonomy of nine primary task categories, each represented by multiple datasets and evaluation metrics suited to their operational semantics (Babakhin et al., 10 Nov 2025, Lee et al., 10 Mar 2025):

Retrieval: Passage, instruction, and document retrieval evaluated by recall@k, mean reciprocal rank (MRR), or mean average precision (MAP).
Classification: Topic, intent, scenario, pairwise, and multi-label classification, assessed by accuracy or F1.
Bitext Mining: Precision/recall-based evaluation of parallel sentence mining across language pairs.
Clustering: Quantified by purity, normalized mutual information (NMI), adjusted Rand index (ARI), or V-measure.
Reranking: Secondary ranking problems using MRR or MAP.
Semantic Textual Similarity (STS): Pearson’s r or Spearman’s ρ correlation between predicted cosine similarity and gold standards.
Summarization: ROUGE or correlation-based scoring between generated and reference summaries.

To reflect the breadth of linguistic diversity, MMTEB employs both original and translated datasets, as well as regional expansions like AfriMTEB and ruMTEB, ensuring macro-averaged scoring over languages to avoid dominance by resource-rich languages (Uemura et al., 27 Oct 2025, Snegirev et al., 2024).

3. Evaluation Methodology and Metrics

MMTEB enforces unified evaluation protocols to ensure comparability across models and tasks (Muennighoff et al., 2022, Babakhin et al., 10 Nov 2025):

Embedding extraction: Input texts are tokenized, encoded by the model, and pooled (typically mean pooling followed by L2 normalization).
Task-specific evaluation:
- Retrieval and bitext mining use recall@k,
$\mathrm{Recall}@k = \frac{1}{|\mathcal{Q}|} \sum_{q\in \mathcal{Q}} \mathbf{1}[\text{relevant in top-}k].$ - Classification and pair classification use accuracy,

$\mathrm{Accuracy} = \frac{\#\text{correct}}{N}.$ - Clustering is measured by v-measure,

$\mathrm{V} = 2\frac{h\,c}{h+c}.$ - STS applies Spearman’s ρ,

$\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}.$
Scoring aggregation: MMTEB leaderboard typically employs the Borda count method, where each task ranks models and overall scores reflect aggregate votes, with secondary metrics including means over task and type (Babakhin et al., 10 Nov 2025, Lee et al., 10 Mar 2025).

4. Statistical Significance and Model Comparison

To ensure robust conclusions, task-wise results are complemented by statistical tests; changes exceeding 2 standard deviations ( $\sigma$ units) are declared significant. MMTEB results have shown that, for post-processing techniques such as renormalization (Ren et al., 14 Nov 2025), over 50% of tasks in retrieval and clustering categories yield statistically significant improvements across evaluated models:

Retrieval: Projection-based renormalization (R2) gives a 9.7 $\sigma$ average uplift.
Classification: Average uplift of 3.1 $\sigma$ .
Other multitask domains (bitext mining, clustering, reranking, STS, summarization): 0.8 $\sigma$ average uplift.

Task-level analyses confirm that improvements are broadly distributed, though some categories (multi-label, reranking) show more modest gains across fewer models.

5. Benchmark Impact on Model Development

MMTEB has catalyzed model innovation in multilingual and universal text embedding:

Architectural adaptation: Models such as llama-embed-nemotron-8b convert decoder-only LLMs to bidirectional encoders and fine-tune them on diverse data mixes, including millions of public and synthetic pairs (Babakhin et al., 10 Nov 2025).
Instruction tuning: Task-specific prompts and instruction-aware architectures enhance performance across heterogeneous tasks by enabling dynamic behavior switch via input formatting.
Synthetic data integration: State-of-the-art models are trained not only on high-quality public datasets (e.g., Nemotron-CC-v2, MS MARCO, MIRACL) but also on large volumes of synthetic samples generated by open-weight LLMs, with ablation studies demonstrating that synthetic-data diversity systematically offsets in-domain data scarcity.
Model soup: Parameter averaging (model merging) consolidates strengths from multiple training checkpoints, improving generalization and mean performance over almost all task families (Babakhin et al., 10 Nov 2025).
Renormalization methods: MMTEB experiments confirm the existence of a near-constant mean bias ( $\mu$ ) in embedding outputs, motivating plug-and-play, training-free post-processing steps that remove shared mean components and spread encodings more isotropically, delivering up to $\sim$ +10% relative gains in retrieval and consistent boosts elsewhere (Ren et al., 14 Nov 2025).

6. Regional Expansions and Language Equity

MMTEB's extensible architecture underpins a series of regional benchmarks addressing gaps in language coverage and task diversity. Notable integrations include:

ruMTEB: Provides 23 Russian datasets covering classification, semantic similarity, clustering, and more, with baseline and state-of-the-art multilingual and Russian-focused embedders evaluated under identical codebase and metrics (Snegirev et al., 2024).
MTEB-French: Implements the first massive French embedding benchmark, reusing the MMTEB evaluation pipeline to ensure cross-linguistic comparability (Ciancone et al., 2024).
AfriMTEB: Extends MMTEB to 59 African languages, introducing novel tasks like hate speech, intent detection, and emotion classification, and demonstrates that targeted adaptation (AfriE5) surpasses strong multilingual baselines (Uemura et al., 27 Oct 2025).

7. Practical Recommendations and Reproducibility

MMTEB is released with open-source code, datasets, and continuous leaderboards (hosted on public platforms such as GitHub and HuggingFace (Muennighoff et al., 2022, Snegirev et al., 2024, Ciancone et al., 2024)). The standard evaluation pipeline allows for rapid benchmarking of new models by minimal code modification. Recipe for employing state-of-the-art post-processing (projection-based renormalization, R2 (Ren et al., 14 Nov 2025)):

Collect $\sim$ 100,000 disjoint sentences (e.g., Wikipedia).
Compute corpus mean $\mu = \operatorname{mean}_i\, e(t_i)$ , $\hat{\mu} = \mu/\|\mu\|$ .
At inference, replace each embedding $e$ by

$e' = \frac{e - \langle e, \mu\rangle/\langle\mu, \mu\rangle\,\mu}{\| e - \langle e,\mu\rangle/\langle\mu,\mu\rangle\,\mu \|}.$

Widespread benchmarking via MMTEB propels the field towards robust, truly universal text embedding solutions, encourages the inclusion of previously underrepresented languages and tasks, and enables reproducible, statistically-grounded model selection and development (Ren et al., 14 Nov 2025, Babakhin et al., 10 Nov 2025, Lee et al., 10 Mar 2025, Uemura et al., 27 Oct 2025, Snegirev et al., 2024, Ciancone et al., 2024, Muennighoff et al., 2022).