Introduction
In the field of text embeddings, the development of models that excel not just in understanding English but multiple languages is essential for creating globally accessible technologies. This technical report unveils the creation and evaluation of the multilingual E5 (mE5) text embedding models, which apply and extend the methodology behind the successful English E5 models to a multilingual context. Notably, these models include an innovative instruction-tuned variant designed to bridge the performance gap often seen in multilingual applications compared to English-centric models.
Training Methodology
The mE5 models underwent a twofold training process. Initially, they were subjected to weakly-supervised contrastive pre-training on approximately 1 billion multilingual text pairs sourced from a diverse set of data. The datasets included in the pre-training phase range from Wikipedia to specialized sources like Stackexchange and Reddit, ensuring a broad linguistic and domain variety.
Subsequently, the models received supervised fine-tuning on a mixture of high-quality labeled datasets comprising around 1.6 million samples across various data types and tasks. This step aimed to refine the models' performance in specific, high-value tasks. Notably, the instruction-tuned mE5-large-instruct model incorporated an additional 500k synthetic data points generated via GPT-3.5/4, covering 93 languages and 150k unique instructions, promising expanded language coverage and improved understanding of nuanced embedding tasks.
Experimental Results
The evaluation of mE5 models involved both English-centric benchmarks and multilingual assessments to verify their efficacy in diverse linguistic contexts.
- English Text Embedding Benchmark: On the MTEB benchmark, the mE5-large-instruct model demonstrated superior performance, even outpacing strong English-only models and previous state-of-the-art multilingual models. This showcases the effectiveness of the instruction-tuned approach in achieving high-quality embeddings across languages.
- Multilingual Retrieval: The mE5 models exhibited outstanding performance in the MIRACL benchmark, significantly outperforming baseline models across 16 languages. This result underscores the models' robustness and versatility in handling retrieval tasks in a wide array of languages.
- Bitext Mining: In tasks involving the identification of semantically similar sentences across languages, mE5 models again showed exceptional capability. The instruction-tuned mE5-large-instruct model notably surpassed LaBSE, a model tailored for bitext mining, highlighting the value of instruction tuning and synthetic data in enhancing model performance across languages, including those with limited resources.
Conclusion
The mE5 text embedding models represent a significant advancement in the field of multilingual text embeddings. By providing models across different sizes and introducing an instruction-tuned variant, this work not only offers a new benchmark for embedding quality but also broadens the accessibility and applicability of AI technologies across languages. The release of these models for public use sets the stage for further innovation and application in multilingual information retrieval, semantic similarity assessment, and linguistic data clustering. With their demonstrated performance across both English-centric and multilingual benchmarks, the mE5 models mark an important step forward in the pursuit of truly global AI solutions.