Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 150 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Massive Legal Embedding Benchmark (MLEB)

Updated 29 October 2025
  • The MLEB is an open-source benchmark featuring 10 expert-annotated datasets across 6 jurisdictions, designed to evaluate embedding models in legal information retrieval.
  • It addresses prior limitations by incorporating diverse document types, rigorous manual annotation, and comprehensive jurisdictional coverage for enhanced legal NLP tasks.
  • Evaluation using NDCG@10 demonstrates that domain-adapted embedding models outperform general ones, setting a new standard for legal retrieval benchmarks.

The Massive Legal Embedding Benchmark (MLEB) is an open-source resource designed for the comprehensive evaluation of embedding models in the legal domain, particularly as applied to legal information retrieval and retrieval-augmented generation (RAG) tasks. With ten expert-annotated datasets spanning six jurisdictions and multiple legal document types, MLEB establishes a new standard in both the scale and the diversity of legal NLP benchmarking (Butler et al., 22 Oct 2025). It addresses key limitations in prior datasets—such as jurisdictional narrowness, lack of document variety, and inconsistent labeling quality—enabling systematic scrutiny of embedding models for the rigors and intricacies of legal text processing.

1. Foundation and Motivation

The creation of MLEB arose in response to substantial shortcomings in existing legal NLP benchmarks—namely, the restricted focus on US-centric contracts and the prevalence of automatically or crowd-labeled data, which often failed to reflect genuine legal research needs. Seven of MLEB's ten datasets were newly constructed using manual annotation by legal professionals, specifically to fill topical and jurisdictional gaps in open-source legal information retrieval resources. Core motivations included the need for high label quality, jurisdictional breadth, and diversity in document and task types.

2. Jurisdictions, Document Types, and Dataset Structure

MLEB’s scope covers:

  • Six jurisdictions: United States, United Kingdom, European Union, Australia, Ireland, Singapore.
  • Document types: Judicial (case law), Regulatory (legislation, guidance), Contractual (contracts, clauses), Legal literature.

The benchmark comprises the following datasets:

Name Queries Domain Jurisdiction
Bar Exam QA 117 Judicial US
SCALR 120 Judicial US
Singaporean Judicial Keywords 500 Judicial Singapore
GDPR Holdings Retrieval 500 Judicial EU
Australian Tax Guidance 112 Regulatory Australia
Irish Legislative Summaries 500 Regulatory Ireland
UK Legislative Long Titles 78 Regulatory UK
Contractual Clause Retrieval 90 Contractual Multinational
License TL;DR Retrieval 65 Contractual Multinational
Consumer Contracts QA 198 Contractual Multinational

In total, these datasets encompass 2,280 query–passage pairs. Domains represented include judicial (4 datasets), regulatory (3), and contractual (3), providing coverage of both common law and civil law systems and a wide spectrum of legal information tasks.

3. Construction Methodology and Annotation

MLEB was constructed with an explicit emphasis on legal expertise and manual quality control:

  • Expert annotation: 7 out of 10 datasets are newly created and labeled by legal professionals.
  • Public and authoritative sources: Data were extracted from courts, legislatures, and regulatory guidance materials using tools such as Inscriptis for web extraction and simhash for deduplication.
  • Rigorous query–passage construction: For each query, paired passages were selected to enforce semantic (not superficial) relevance.
  • Real-world representativeness: Actual user questions—especially those that defied classic IR approaches—were favored (e.g., in the Australian Tax Guidance set).
  • Documentation and transparency: All cleaning, annotation, and dataset construction steps are documented; datasets are openly licensed and released for reproducibility on platforms such as Hugging Face and GitHub.

4. Evaluation Tasks and Metrics

MLEB supports multiple evaluation task types:

  • Legal Search/Retrieval: All ten datasets are centered around the retrieval task, framed as query–passage relevance matching.
  • Zero-Shot Classification: Particularly in the Contractual Clause Retrieval task, where clause type definitions are matched to actual clauses in an NLI-style format.
  • Question Answering: Datasets such as Bar Exam QA and Consumer Contracts QA support formal question–answering frameworks.

The main metric is NDCG@10 (Normalized Discounted Cumulative Gain at rank 10), which is standard in information retrieval and is computed per task, dataset, and model. Top-performing embedding models are legal domain-adapted, with the “Kanon 2 Embedder” achieving task-averaged NDCG@10 of 86.03, followed by “Voyage 3 Large” and “Voyage 3.5“ (Butler et al., 22 Oct 2025). This validates the utility of specialized legal embeddings over general-purpose textual models.

5. Comparison to Prior Benchmarks

Previous benchmarks such as LegalBench-RAG and MTEB-Legal were limited in various respects:

  • LegalBench-RAG: Primarily US contracts, lacking document and jurisdictional variety.
  • MTEB-Legal: US-centric with linguistic and labeling limitations, including issues with automated relevance labeling and mixed-language/country data introducing bias.

MLEB distinguishes itself by providing broad jurisdictional coverage across three continents, rigorous manual labeling by legal experts, and detailed documentation for code and data. This combination supports robust evaluation of domain-specific embedding models and is designed for transparent, reproducible experimentation.

MLEB integrates advances highlighted in related research. For instance, the multi-layered embedding-based retrieval method (Lima, 12 Nov 2024) demonstrates that representing legal texts at multiple structural levels—such as articles, paragraphs, and clauses—facilitates granular retrieval and better captures legal “aboutness,” or the central semantic theme of the text. These technical strategies are implicitly aligned with MLEB's construction, where careful chunking and semantic granularity are prioritized to support diverse retrieval and RAG scenarios.

Analogical reasoning benchmarks in other jurisdictions, such as the Legal Analogical Reasoning Questions Set (LARQS) derived from 2,388 Chinese codices (Lin et al., 2022), further inform MLEB's methodology. LARQS demonstrates that analogy tasks based on legal relations provide a practical and domain-appropriate baseline for legal embedding evaluation and highlights the need to explicitly model legal relationships that general language benchmarks neglect. This suggests MLEB's design is extensible to multilingual and global settings, though manual curation remains a scaling challenge.

7. Impact and Future Directions

MLEB is openly available for academic and industry researchers, with all datasets, evaluation code, and experimental results released under open-source licenses (URLs provided in Section 6 of the paper). The openness and extensibility of MLEB eliminate legal and proprietary barriers that previously constrained legal IR benchmarking. MLEB’s comprehensive and expert-annotated corpus sets a robust baseline for evaluating legal NLP systems, supporting reproducible research, fair model comparison, and further development in the field. A plausible implication is that as legal NLP tasks expand in scope and complexity—incorporating multilingual, jurisdiction-specific, and fine-grained legal tasks—the methodological frameworks validated by MLEB will frame future benchmark expansions and inform the development of better legal AI systems.

References

MLEB is recognized as the most complete, quality-assured, jurisdictionally and topically diverse open benchmark for legal embedding evaluation, enabling robust and reproducible research in legal NLP and legal information retrieval (Butler et al., 22 Oct 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Massive Legal Embedding Benchmark (MLEB).