Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 166 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

MTEB English Leaderboard Benchmark

Updated 3 October 2025
  • MTEB English Leaderboard is a comprehensive benchmark that evaluates universal text embedding models using 58 datasets over 8 distinct tasks.
  • It employs standardized evaluation protocols with task-specific metrics like accuracy, Spearman correlation, and nDCG@10 to ensure reproducibility and fair comparison.
  • Community-driven innovation and open submissions via GitHub enhance transparency and drive continuous improvements in text representation research.

The MTEB English Leaderboard is a comprehensive, continuously updated public benchmark designed to rigorously evaluate and compare universal text embedding models on a wide array of real-world tasks in English. Hosted within the @@@@1@@@@ (MTEB) framework, it employs open-source tasks, datasets, and transparent engineering practices to ensure reproducibility, generalizability, and extensibility. The leaderboard serves as a reference point for selecting, tracking, and advancing the state of the art in text embeddings, capturing performance variations across tasks such as classification, clustering, retrieval, and semantic textual similarity.

1. Benchmark Scope and Structure

MTEB was established to address the fragmentation and narrow coverage of earlier benchmarks, which often focused exclusively on semantic textual similarity (STS) tasks or operated in a single language. The English variant includes 58 datasets over eight distinct embedding tasks: Bitext Mining, Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, and Summarization. Each dataset is associated with task-appropriate metrics: accuracy for classification, v-measure for clustering, Spearman correlation for STS, and nDCG@10 or MRR@k for retrieval.

The leaderboard is organized by task and aggregates results using an average of the primary metric for each included dataset, providing per-task and overall rankings. Model comparisons are thus possible at both granular and holistic levels. The leaderboard infrastructure enables access via an open GitHub repository and public interfaces on platforms like the Hugging Face Hub (Muennighoff et al., 2022).

2. Evaluation Methodology and Metrics

Each embedding model submitted to MTEB must conform to a standard API, accepting a list of texts and returning fixed-size embeddings. Evaluation for each task is conducted in a consistent, deterministic environment:

  • Classification: Models generate embeddings for labeled data, typically followed by a linear classifier or nearest centroid approach, with accuracy or F1 as primary metrics.
  • Clustering: Mini-batch k-means is used, with v-measure providing a label-invariant evaluation.
  • Retrieval: Texts are embedded, and retrieval is performed via cosine similarity, evaluated using nDCG@10 or MAP@k.
  • STS: Embedding pairs are correlated (via cosine similarity) and compared with ground truth using Spearman correlation.

Where appropriate, tasks consider various levels of text granularity (sentence, paragraph, document). Models are explicitly compared under identical data conditions to ensure fair competitive comparison.

3. Engineering for Reproducibility and Community Involvement

The MTEB English Leaderboard depends on robust engineering protocols to guarantee reproducibility (Chung et al., 26 Jun 2025):

  • Continuous Integration Pipelines: Every dataset or model submission triggers format validation, metadata completeness checks, and automated linting and unit tests across multiple platforms (Linux, Windows) and Python versions. Use of Pydantic schemas ensures consistency for both tasks and models.
  • Versioning: The system maintains multi-level version control—tasks, datasets, models, and code—ensuring each leaderboard entry can be reproduced from exact component snapshots.
  • Deterministic Environments: Fixed random seeds are logged and used in evaluations (e.g., clustering with K-means) to eliminate stochasticity.
  • Automated Release: Changelogs, artifact uploads, and leaderboard refreshes are automated, with results aggregated daily.
  • Open Submissions: Researchers submit results through GitHub pull requests, which are peer-reviewed and include code and version references for independent verification.
  • Community Extension: The modular codebase and open processes allow incorporation of new tasks, models, and datasets, with community feedback governing design evolution.

4. Models and Performance Insights

The leaderboard highlights the rapid evolution of universal text embedding models and reveals key performance characteristics (Muennighoff et al., 2022, Cao, 27 May 2024):

  • No Universal Dominance: No single embedding model provides top results across all tasks. For instance, transformer-based models such as ST5-XXL dominate on STS and classification, while lighter models (e.g., MPNet, MiniLM) excel in clustering and reranking.
  • Contrastive Learning: Most high-performing models employ large-scale contrastive training. For example, models like E5 and BGE are trained on the Colossal Clean text Pairs dataset sourced from web-scale corpora (e.g., Reddit, Wikipedia) and supplemented with synthetic LLM-generated data for increased coverage and diversity.
  • Loss Innovations: Alternative objective functions are introduced to address issues like gradient saturation in cosine-based contrastive losses. The Universal AnglE Embedding (UAE) utilizes an angle-based loss in complex space, while Matryoshka Representation Learning (MRL) adopts nested subspace objectives for granularity.
  • LLM Integration: Recent advances harness pretrained decoder-only LLMs as embedding backbones (e.g., E5-mistral-7b-instruct, SFR-Embedding-Mistral), utilizing adaptation strategies to enable bidirectional representation and task-specific instruction injection. These models achieve high average scores but incur substantial computational overhead.
  • Task-Specific Trends: Performance gains are pronounced in retrieval, reranking, and clustering, but advances in summarization remain limited. Multilingual support is available (e.g., Multilingual-E5), but low-resource language performance still lags behind the English-dominated results.

5. Methodologies for Automated Leaderboard Construction

Automated leaderboard generation for large-scale benchmarks such as MTEB is underpinned by extraction and aggregation techniques (Singh et al., 2018, Hou et al., 2019):

  • Structured Extraction: Systems mine performance data from tables and context-annotated document sections, identifying tasks, datasets, metrics, and scores through transformer-based NLI models (e.g., TDMS-IE). Extraction is enhanced using engineered document representations (such as the DocTAET structure) combining abstract, experimental setup, and table metadata.
  • Ranking via Performance Graphs: Extracted pairwise performance comparisons are encoded as directed performance improvement graphs, with nodes representing papers or models and edges denoting relative improvements on specific metrics. Edges carry normalized improvement values (e.g., via sigmoid-transformed REI) and are aggregated using approaches such as PageRank with careful graph sanitization to address noise and inconsistent reporting.
  • Leaderboard Robustness: These methodologies achieve high rank-order correlation with expert-curated benchmarks in a range of domains and are resilient to extraction errors and incomplete reporting. Ranking quality is validated via recall, precision, and NDCG metrics.

6. Generalizability and Reliability

A critical element of MTEB leaderboard credibility is the emphasis on out-of-distribution generalization and transparent result reporting (Chung et al., 26 Jun 2025):

  • Zero-Shot Score: The metric z=1(ntrain/ntotal)z = 1 - (n_{train} / n_{total})—where ntrainn_{train} is the number of benchmark datasets present in model training and ntotaln_{total} is the total number of datasets—quantifies how much model performance depends on previously seen benchmark data. High zero-shot scores indicate stronger generalization.
  • Community Feedback: Earlier strict enforcement of perfect zero-shot filtering has evolved into a balanced approach, providing context around each result and not excluding honest disclosures of training set overlaps.
  • Reproducibility Annotations: Each result is annotated with full versioning, and infrastructure is in place for community verification and cross-environmental repeatability, including explicit logging of environmental details and computational footprints.

7. Limitations and Future Directions

Despite its comprehensive design, the MTEB English Leaderboard faces several acknowledged limitations and is subject to ongoing development (Muennighoff et al., 2022, Cao, 27 May 2024, Chung et al., 26 Jun 2025):

  • Coverage Imbalance: Summarization tasks remain underrepresented (often with only a single English dataset), and many real-world domains (e.g., finance, culture, health) are insufficiently covered.
  • Instruction Length and Embedding Geometry: The implication of task instruction prompts (beneficial on retrieval and reranking) is not fully understood across all task types. Furthermore, the universal reliance on cosine similarity may not always reflect nuanced human similarity judgments, prompting calls for new metric research.
  • Multilingual and Long-Text Support: Expansion to low-resource languages and long-context evaluations are ongoing.
  • Scalability and Cost: LLM-based methods, while effective, demand significant computational resources, raising concerns about sustainable deployment.
  • Benchmark Sustainability: The engineering structure for community-driven extensibility and automated validation serves as a blueprint for scaling to new tasks, domains, and evaluation paradigms.

Summary Table: Key MTEB English Leaderboard Dimensions

Dimension Details/Specification Source
Task Coverage 8 tasks, 58 datasets (English) (Muennighoff et al., 2022)
Model Submission Open, GitHub-integrated, with CI pipelines (Chung et al., 26 Jun 2025)
Evaluation Metrics Accuracy, v-measure, Spearman, nDCG@10, MRR@k (Muennighoff et al., 2022)
Embedding Methods Contrastive-trained, LLM-adapted, loss innovations (Cao, 27 May 2024)
Out-of-Distribution Metric Zero-shot score zz, version logging (Chung et al., 26 Jun 2025)
Public Access GitHub, Hugging Face, daily updates (Muennighoff et al., 2022Chung et al., 26 Jun 2025)

The MTEB English Leaderboard provides an open, reproducible, and extensible environment for robust comparison of universal text embedding models. Its rigor and community-driven extensibility position it as the primary reference point for benchmarking advances in text representation learning.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MTEB English Leaderboard.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube