Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 74 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

MTEB-NL: Dutch Embedding Benchmark

Updated 22 September 2025
  • MTEB-NL is a standardized benchmark suite dedicated to evaluating Dutch text embeddings across diverse tasks.
  • It aggregates 40 datasets from legacy and new sources into 7 task classes, ensuring robust and fine-grained performance assessment.
  • The initiative introduces efficient E5-NL models with language-specific adaptations that lower resource demands while improving accuracy.

The Massive Text Embedding Benchmark for Dutch (MTEB-NL) is a comprehensive suite for evaluating text embedding models on a wide spectrum of tasks in the Dutch language. Conceived to address the chronic underrepresentation of Dutch in multilingual NLP resources, MTEB-NL builds on prior benchmarks by aggregating, standardizing, and expanding Dutch data, providing a unified framework for zero-shot and fine-tuned assessment of embedding models. This initiative includes both task benchmarks and the release of optimized Dutch-specific models, notably the E5-NL series, offering practitioners a reproducible platform for large-scale comparison and advancement in Dutch NLP (Banar et al., 15 Sep 2025).

1. Scope and Composition of the MTEB-NL Benchmark

MTEB-NL extends the original MTEB framework, introducing language-specialized coverage:

  • Datasets: 40 Dutch datasets, sourced from legacy resources (BEIR-NL, DUMB) and newly curated collections, all harmonized to the MTEB format.
  • Task Types: Seven core classes:
    • Classification (binary and multi-class)
    • Multi-label classification
    • Pair classification (semantic parity)
    • Retrieval (query-document relevance)
    • Reranking (ordering by relevance)
    • Clustering (document grouping)
    • Semantic Textual Similarity (STS)

Each dataset is annotated and quality-checked for language-specific nuances and provenance. This inclusion addresses both general-domain and specialized subtasks, supporting fine-grained analysis of Dutch embedding models.

2. E5-NL Model Family: Dutch-Adaptive Embedding Architectures

Complementing the benchmark, the authors introduce the E5-NL model family, which are Dutch-adapted variants of the well-established E5 embedding architectures:

  • Vocabulary Adaptation:
    • Trimming approach (“-trm”): Reduces vocabulary size (e.g., 250K→50K), preserving efficiency and sufficient lexical coverage.
    • Translation tokenization (“-t2t”): Aligns English vocabulary indices to Dutch lexicon, facilitating compatibility.
  • Parameter Efficiency: Only the embedding matrices are updated; transformer layers remain intact.
  • Performance: E5-NL, especially “-trm-nl” variants, reach state-of-the-art results in Dutch retrieval, STS, clustering, and classification, with small and base models often outperforming prior Dutch-specific and even larger multilingual instruct-tuned systems.
  • Efficiency: Parameter reduction confers faster inference and lower resource demands without sacrificing accuracy.

3. Training Dataset Compilation and Synthetic Data Generation

High-quality Dutch embedding models require task-comprehensive training data:

  • Human-annotated Corpus: Sourced from mMARCO-NL, FEVER-NL, HotPotQA-NL, totaling ∼620K samples, primarily supporting retrieval.
  • Synthetic Augmentation: To bridge task diversity gaps, LLMs are prompted with templates to generate paired and triplet examples for tasks beyond retrieval.
    • Templates span short-long/long-short queries, symmetric/asymmetric pairs, and STS triplets.
    • Controlled topic sampling uses distributions from MS MARCO queries.
    • Prompt design incorporates Dutch regional entities, idiomatic usage, and task-specific signal.
  • Quality Filtering: Synthetic examples are screened using a reranking model, applying a margin-based heuristic:

0<S(positive)S(negative)<C0 < S(\text{positive}) - S(\text{negative}) < C

retaining challenging negatives and reducing trivial pairs.

  • Dataset Size: The final blend (human-annotated plus filtered synthetic) comprises ∼950K samples.

4. Evaluation Protocols, Metrics, and Results

Models are evaluated on the MTEB-NL benchmark using standard metrics tailored to each task:

Task Type Metric Description
Classification F1-macro score Measures balanced accuracy across all classes
Retrieval nDCG@10 Normalized Discounted Cumulative Gain at rank 10
Reranking MAP Mean Average Precision for ranking relevance
STS Spearman correlation Magnitude of monotonicity between model and ground-truth
  • Optimization: Models are trained on InfoNCE contrastive loss:

Loss=log(exp(sim(q,p)/τ)jexp(sim(q,pj)/τ))\text{Loss} = -\log\left(\frac{\exp(\text{sim}(q, p)/\tau)}{\sum_j \exp(\text{sim}(q, p_j)/\tau)}\right)

where sim(q,p)\text{sim}(q, p) is cosine similarity, τ\tau is a temperature hyperparameter, and the sum runs over all negatives in the batch.

  • Results: E5-NL models—especially “-trm-nl”—show superior scores across tasks compared to prior Dutch baselines, demonstrating robust semantic performance with lower computational footprint.

5. Public Release and Reproducibility

All resources described—benchmark datasets, training samples, and model weights—are publicly accessible:

  • Distribution: Hugging Face Hub hosts MTEB-NL and E5-NL collections.
  • MTEB Package: Full integration provides plug-and-play model evaluation and benchmarking.
  • Documentation: Model and dataset documentation covers evaluation instructions and replicability protocols.

Ready public access enables researchers to train, fine-tune, and assess Dutch embedding models in a unified, standardized framework, supporting transparent comparative analysis and downstream deployment.

6. Context, Significance, and Implications

MTEB-NL addresses a gap in standardized Dutch NLP evaluation, providing infrastructure for widespread benchmarking and model development. The combination of task diversity and language-specific adaptation mitigates historical limitations of multilingual resources, establishing Dutch as a first-class language for embedding research. The use of synthetic augmentation expands task breadth, allowing embedding models to generalize beyond traditional retrieval, while the release of compact Dutch models establishes a benchmark for efficiency. A plausible implication is that future Dutch NLP systems can be evaluated and developed with parity to high-resource languages, fostering quicker progress and broader adoption within the community.

7. Future Directions

The authors indicate several avenues for future enhancement:

  • Expanding the benchmark with additional annotated Dutch datasets.
  • Improving synthetic data generation and filtering, possibly adapting more advanced LLM prompting and negative sampling schemes.
  • Adapting model merging strategies to further boost multi-task generalization as identified in related works (Li et al., 19 Oct 2024).
  • Supporting direct benchmarking of Dutch models in larger multilingual contexts such as MMTEB (Enevoldsen et al., 19 Feb 2025).

These directions suggest that MTEB-NL will continue to evolve, strengthening Dutch NLP by integrating new methodological advances and expanding linguistic coverage.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Massive Text Embedding Benchmark for Dutch (MTEB-NL).