Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 74 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 37 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 32 tok/s Pro

2000 character limit reached

E5-NL Models: Dutch Text Embeddings

Updated 22 September 2025

E5-NL models are Dutch text embedding systems that use dual initialization strategies—vocabulary trimming and transtokeniser mapping—to capture semantic nuances.
They are trained on a hybrid corpus of 620K human-annotated examples and 350K synthetic triplets, with dynamic hard negative mining optimizing retrieval performance.
Evaluations on the MTEB-NL benchmark demonstrate that trimmed vocabulary variants consistently outperform cross-lingual mapped models, setting new state-of-the-art standards.

E5-NL models constitute a suite of Dutch-specific text embedding models designed to address the chronic underrepresentation of Dutch in published multilingual embedding resources. Developed as compact, high-performance encoder models, E5-NL integrates and adapts innovations from the broader E5 model family for the linguistic and resource characteristics of Dutch. These models are released alongside the MTEB-NL benchmark, a comprehensive and diversified evaluation platform for Dutch-language embeddings and downstream NLP tasks (Banar et al., 15 Sep 2025).

1. Architecture and Initialization Strategies

E5-NL models implement a transformer encoder architecture that yields low-dimensional vector representations capturing semantic similarity between Dutch text segments. Two primary initialization strategies are explored:

e5-trm (Vocabulary Trimming):

This variant begins with a multilingual E5 model and systematically reduces its vocabulary from approximately 250,000 tokens to 50,000 tokens. The trimming process is not merely heuristic: it is designed to shrink the model by up to 66% for small models, 55% for base, and 37% for large variants, while adapting the lexical representation to maximize Dutch coverage and minimize redundant parameters.

e5-t2t (Cross-lingual Mapping via Transtokeniser):

Here, the English E5 vocabulary is mapped onto the Dutch vocabulary of the Bertje model using the transtokeniser technique. This sometimes necessitates randomly initialized entries for out-of-vocabulary tokens, an unavoidable artifact of the mapping process but one that preserves cross-lingual alignment capabilities.

Both approaches instantiate encoder models that are further fine-tuned specifically for Dutch textual contexts.

2. Training Data Composition and Optimization

Training leverages a hybrid corpus comprising both curated, human-annotated Dutch retrieval datasets and synthetic triplets:

Human-annotated sets:

Datasets such as mMARCO-NL, FEVER-NL, and HotPotQA-NL contribute approximately 620,000 examples, in line with established retrieval and QA evaluation protocols.

Synthetic sets:

Roughly 350,000 filtered triplets are generated using LLM-driven prompt pipelines, with sampling strategies reflecting empirical Dutch topic distributions (e.g., via MS MARCO queries classified with Google’s Content Classification API) and prompts adapted for regional variants (Flemish/Dutch references).

A critical methodology for robust retrieval efficacy is dynamic hard negative mining. Leveraging a teacher model (multilingual-e5-large-instruct), candidate negatives are ranked, and TopK-STDMarginPos sampling avoids negatives too close in score to positives, hence minimizing false negatives. This delivers maximally contrasting training examples.

Optimization employs the InfoNCE loss:

$L = - \log \left[\frac{\exp (s(q, p)/\tau)}{\exp (s(q, p)/\tau) + \sum_{n} \exp (s(q, n)/\tau)}\right]$

where $s(\cdot,\cdot)$ is typically cosine similarity, and $\tau$ is a temperature hyperparameter. Training is conducted in source-homogeneous batches (single dataset per batch) to minimize positive-negative collision, with learning rates ranging from $2\times10^{-6}$ (large models) to $1\times10^{-5}$ (smaller ones). Supervised models are trained for one epoch; self-supervised ones for three.

3. Evaluation and Performance on MTEB-NL

The Massive Text Embedding Benchmark for Dutch (MTEB-NL) serves as the standard evaluation framework. It aggregates 40 datasets—both existing and newly created—encompassing retrieval, classification, clustering, and semantic textual similarity tasks. The evaluation assesses models using average dataset and task scores (AvgD, AvgT). Notably:

E5-NL models, even at reduced parameter counts, consistently achieve state-of-the-art results among all non-instruct models.
The e5-small-trm-nl and base-trm-nl variants frequently outperform larger general-purpose or multilingual models, underscoring the advantage of domain-specific vocabulary and training regimes.
The -trm (trimmed vocabulary) initialization typically outperforms -t2t (transtokeniser-based) approaches, suggesting that aggressive domain-specific trimming yields meaningful performance gains when sufficient monolingual data is available.

Model Variant	Parameters	Notable Feature	Average MTEB-NL Score
e5-small-trm-nl	Small	Trimmed vocabulary	Highest among small
e5-base-trm-nl	Base	Trimmed vocabulary	Strong across tasks
e5-large-trm-nl	Large	Trimmed vocabulary	Top absolute scores
e5-small-t2t-nl	Small	Cross-lingual mapped vocab	Lower than -trm

The table summarizes the key model variants and their distinguishing characteristics on MTEB-NL.

4. Innovations and Unique Methodological Advances

Distinct features in E5-NL development include:

Dual initialization strategies:

Both vocabulary trimming and cross-lingual mappings are explored, offering a modular approach to transferring embedding models to underrepresented languages.

Synthetic data pipeline:

The use of prompt-engineered LLMs for synthetic triplet generation allows for precise control over regional context, topic distribution, and dataset balance.

Dynamic hard negative mining:

The TopK-STDMarginPos sampling strategy, guided by a teacher model, enables the model to avoid ambiguous negative samples—a frequent source of error in dense retrieval settings.

Efficient, compact model design:

The aggressive pruning of non-Dutch vocabulary and judicious model sizing deliver strong performance while significantly reducing computational requirements, supporting broader deployment.

5. MTEB-NL as a Benchmarking Resource

MTEB-NL, inspired by the original MTEB paradigm, is specifically constructed to overcome Dutch's underrepresentation in multilingual benchmarks. It includes both translated and natively sourced datasets, arranged across seven task categories for robust zero-shot and task-transfer evaluation. The availability of MTEB-NL supports rigorous, targeted evaluation for Dutch and bridges the gap between broad multilingual models and language-specialized techniques.

Evaluations indicate that supervised fine-tuning on targeted and synthetic data yields consistent improvements, especially in retrieval and similarity tasks, over self-supervised pretraining. This finding reinforces the value of language-tailored data curation and model adaptation for performance-critical settings.

6. Implications and Directions for Dutch NLP

The introduction of E5-NL models and MTEB-NL benchmark marks a substantive advance for Dutch-language NLP:

The availability of compact, effective models like e5-small-trm-nl democratizes high-quality embeddings for applications with limited resources.
The tailored data generation and robust negative mining protocols provide a blueprint for developing embeddings in other underrepresented languages.
Future research avenues include developing instruct models for Dutch (building on successes in multilingual and English contexts), reducing reliance on machine-translated training data, expanding the diversity of native resources, and exploring transfer from Dutch generative models (such as ChocoLLama) for embedding use cases.
A plausible implication is that comprehensive evaluation infrastructure (such as MTEB-NL) combined with efficient training protocols is critical for advancing semantic modeling in languages with relatively scarce resources.

7. Summary

E5-NL models exemplify a systematic adaptation of state-of-the-art dense embedding methodologies for Dutch, incorporating dual initialization, synthetic data augmentation, and advanced negative sampling. The MTEB-NL benchmark ensures that evaluations reflect real-world Dutch NLP demands across a spectrum of tasks. Together, these resources enable both practical applications, such as legal and medical information retrieval or semantic clustering, and scientific paper, offering a foundation for ongoing research and development in Dutch language technologies (Banar et al., 15 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch (2025)

Follow Topic

Get notified by email when new papers are published related to E5-NL Models.