Massive Legal Embedding Benchmark (MLEB)

Updated 29 October 2025

The MLEB is an open-source benchmark featuring 10 expert-annotated datasets across 6 jurisdictions, designed to evaluate embedding models in legal information retrieval.
It addresses prior limitations by incorporating diverse document types, rigorous manual annotation, and comprehensive jurisdictional coverage for enhanced legal NLP tasks.
Evaluation using NDCG@10 demonstrates that domain-adapted embedding models outperform general ones, setting a new standard for legal retrieval benchmarks.

The Massive Legal Embedding Benchmark (MLEB) is an open-source resource designed for the comprehensive evaluation of embedding models in the legal domain, particularly as applied to legal information retrieval and retrieval-augmented generation (RAG) tasks. With ten expert-annotated datasets spanning six jurisdictions and multiple legal document types, MLEB establishes a new standard in both the scale and the diversity of legal NLP benchmarking (Butler et al., 22 Oct 2025). It addresses key limitations in prior datasets—such as jurisdictional narrowness, lack of document variety, and inconsistent labeling quality—enabling systematic scrutiny of embedding models for the rigors and intricacies of legal text processing.

1. Foundation and Motivation

The creation of MLEB arose in response to substantial shortcomings in existing legal NLP benchmarks—namely, the restricted focus on US-centric contracts and the prevalence of automatically or crowd-labeled data, which often failed to reflect genuine legal research needs. Seven of MLEB's ten datasets were newly constructed using manual annotation by legal professionals, specifically to fill topical and jurisdictional gaps in open-source legal information retrieval resources. Core motivations included the need for high label quality, jurisdictional breadth, and diversity in document and task types.

2. Jurisdictions, Document Types, and Dataset Structure

MLEB’s scope covers:

Six jurisdictions: United States, United Kingdom, European Union, Australia, Ireland, Singapore.
Document types: Judicial (case law), Regulatory (legislation, guidance), Contractual (contracts, clauses), Legal literature.

The benchmark comprises the following datasets:

Name	Queries	Domain	Jurisdiction
Bar Exam QA	117	Judicial	US
SCALR	120	Judicial	US
Singaporean Judicial Keywords	500	Judicial	Singapore
GDPR Holdings Retrieval	500	Judicial	EU
Australian Tax Guidance	112	Regulatory	Australia
Irish Legislative Summaries	500	Regulatory	Ireland
UK Legislative Long Titles	78	Regulatory	UK
Contractual Clause Retrieval	90	Contractual	Multinational
License TL;DR Retrieval	65	Contractual	Multinational
Consumer Contracts QA	198	Contractual	Multinational

In total, these datasets encompass 2,280 query–passage pairs. Domains represented include judicial (4 datasets), regulatory (3), and contractual (3), providing coverage of both common law and civil law systems and a wide spectrum of legal information tasks.

3. Construction Methodology and Annotation

MLEB was constructed with an explicit emphasis on legal expertise and manual quality control:

Expert annotation: 7 out of 10 datasets are newly created and labeled by legal professionals.
Public and authoritative sources: Data were extracted from courts, legislatures, and regulatory guidance materials using tools such as Inscriptis for web extraction and simhash for deduplication.
Rigorous query–passage construction: For each query, paired passages were selected to enforce semantic (not superficial) relevance.
Real-world representativeness: Actual user questions—especially those that defied classic IR approaches—were favored (e.g., in the Australian Tax Guidance set).
Documentation and transparency: All cleaning, annotation, and dataset construction steps are documented; datasets are openly licensed and released for reproducibility on platforms such as Hugging Face and GitHub.

4. Evaluation Tasks and Metrics

MLEB supports multiple evaluation task types:

Legal Search/Retrieval: All ten datasets are centered around the retrieval task, framed as query–passage relevance matching.
Zero-Shot Classification: Particularly in the Contractual Clause Retrieval task, where clause type definitions are matched to actual clauses in an NLI-style format.
Question Answering: Datasets such as Bar Exam QA and Consumer Contracts QA support formal question–answering frameworks.

The main metric is NDCG@10 (Normalized Discounted Cumulative Gain at rank 10), which is standard in information retrieval and is computed per task, dataset, and model. Top-performing embedding models are legal domain-adapted, with the “Kanon 2 Embedder” achieving task-averaged NDCG@10 of 86.03, followed by “Voyage 3 Large” and “Voyage 3.5“ (Butler et al., 22 Oct 2025). This validates the utility of specialized legal embeddings over general-purpose textual models.

5. Comparison to Prior Benchmarks

Previous benchmarks such as LegalBench-RAG and MTEB-Legal were limited in various respects:

LegalBench-RAG: Primarily US contracts, lacking document and jurisdictional variety.
MTEB-Legal: US-centric with linguistic and labeling limitations, including issues with automated relevance labeling and mixed-language/country data introducing bias.

MLEB distinguishes itself by providing broad jurisdictional coverage across three continents, rigorous manual labeling by legal experts, and detailed documentation for code and data. This combination supports robust evaluation of domain-specific embedding models and is designed for transparent, reproducible experimentation.

6. Technical Innovations in Legal Embedding Benchmarking

MLEB integrates advances highlighted in related research. For instance, the multi-layered embedding-based retrieval method (Lima, 2024) demonstrates that representing legal texts at multiple structural levels—such as articles, paragraphs, and clauses—facilitates granular retrieval and better captures legal “aboutness,” or the central semantic theme of the text. These technical strategies are implicitly aligned with MLEB's construction, where careful chunking and semantic granularity are prioritized to support diverse retrieval and RAG scenarios.

Analogical reasoning benchmarks in other jurisdictions, such as the Legal Analogical Reasoning Questions Set (LARQS) derived from 2,388 Chinese codices (Lin et al., 2022), further inform MLEB's methodology. LARQS demonstrates that analogy tasks based on legal relations provide a practical and domain-appropriate baseline for legal embedding evaluation and highlights the need to explicitly model legal relationships that general language benchmarks neglect. This suggests MLEB's design is extensible to multilingual and global settings, though manual curation remains a scaling challenge.

7. Impact and Future Directions

MLEB is openly available for academic and industry researchers, with all datasets, evaluation code, and experimental results released under open-source licenses (URLs provided in Section 6 of the paper). The openness and extensibility of MLEB eliminate legal and proprietary barriers that previously constrained legal IR benchmarking. MLEB’s comprehensive and expert-annotated corpus sets a robust baseline for evaluating legal NLP systems, supporting reproducible research, fair model comparison, and further development in the field. A plausible implication is that as legal NLP tasks expand in scope and complexity—incorporating multilingual, jurisdiction-specific, and fine-grained legal tasks—the methodological frameworks validated by MLEB will frame future benchmark expansions and inform the development of better legal AI systems.

References

Massive Legal Embedding Benchmark (MLEB), (Butler et al., 22 Oct 2025)
Unlocking Legal Knowledge with Multi-Layered Embedding-Based Retrieval, (Lima, 2024)
An Evaluation Dataset for Legal Word Embedding: A Case Study On Chinese Codex, (Lin et al., 2022)

MLEB is recognized as the most complete, quality-assured, jurisdictionally and topically diverse open benchmark for legal embedding evaluation, enabling robust and reproducible research in legal NLP and legal information retrieval (Butler et al., 22 Oct 2025).

PDF Markdown Chat (Pro)

References (3)

The Massive Legal Embedding Benchmark (MLEB) (2025)

Unlocking Legal Knowledge with Multi-Layered Embedding-Based Retrieval (2024)

An Evaluation Dataset for Legal Word Embedding: A Case Study On Chinese Codex (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Massive Legal Embedding Benchmark (MLEB).