BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation (2402.03216v4)

Published 5 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.

References (58)

Authors (6)

Jianlv Chen (1 paper)
Shitao Xiao (38 papers)
Peitian Zhang (23 papers)
Kun Luo (31 papers)
Defu Lian (142 papers)
Zheng Liu (312 papers)

Citations (153)

View on Semantic Scholar

Summary

The paper introduces a self-knowledge distillation strategy that aligns dense, lexical, and multi-vector retrieval methods.
The model employs an optimized batching strategy to support both short and long texts across 100 languages.
Empirical results show that M3-Embedding outperforms state-of-the-art methods, including BM25 and popular dense retrieval models.

Introduction

Embedding models are integral to natural language processing, serving to convert text into a format that deep neural networks (DNNs) can interpret. The aim is to represent texts in a latent space that captures their semantic meanings, which is crucial for information retrieval (IR) tasks. Traditional embedding models face limitations: they are often language-specific, tailored to single functionalities, and struggle with long inputs. The novel text embedding model, M3-Embedding, addresses these issues head-on by supporting multilingual inputs, unifying numerous retrieval functionalities, and processing input granularities that encompass both short and long documents.

Technical Innovations

M3-Embedding owes much of its proficiency to several key technical innovations. A novel self-knowledge distillation approach is introduced to align multiple retrieval functionalities—dense retrieval, lexical (sparse) retrieval, and multi-vector retrieval—during training. This harmonization allows different retrieval functionalities to mutually enhance each other instead of working in isolation, which is a common pitfall in multi-objective optimization.

Another area of contribution is the batching strategy applied during model training. This technique ensures large batch sizes for embedding training, without compromising on the inclusion of long-sequence data. By grouping and sampling training data based on sequence length, reducing sequence padding, and employing gradient-checkpointing, a higher training throughput is achieved, improving the model's ability to discriminate and produce high-quality embeddings.

Empirical Evidence

Empirical results provide solid evidence for M3-Embedding's superior capabilities. The model not only demonstrates state-of-the-art performance on widely recognized multi-lingual and cross-lingual benchmarks but also maintains this exceptional performance across a notable span of 100 languages. Notably, the model outperforms existing methods, including BM25 as well as several popular dense and multi-vector retrieval methods.

Additionally, M3-Embedding displays outstanding versatility in retrieval tasks. When using different retrieval methods in unison, the model's re-ranking results show further improvement compared to individual functionalities. This underscores the effectiveness of the integrated approach championed by M3-Embedding.

Conclusion

M3-Embedding stands out as a trailblazing model, establishing the first-of-its-kind versatility in text embeddings. By bridging the gap in language support, retrieval functionalities, and input granularity, it sets a new benchmark for IR applications. The extensive dataset curation, innovative self-knowledge distillation framework, and optimized batching strategy underscore the ingenuity behind M3-Embedding’s success.

The shared open-source model and code pave the way for further developments in the text embedding paradigm. Its exemplary IR quality across diverse languages and robust handling of various document lengths position M3-Embedding as a pivotal resource for future research and real-world applications in text-based retrieval systems.

PDF Markdown

Related Papers

GitHub

GitHub - FlagOpen/FlagEmbedding: Retrieval and Retrieval-augmented LLMs (7,436 stars)

Tweets

https://twitter.com/jobergum/status/1754980877669388675

https://twitter.com/jobergum/status/1756018718864195872

https://twitter.com/jobergum/status/1758167863779586195

https://twitter.com/arxivsanitybot/status/1755221569159262424

https://twitter.com/jobergum/status/1760623352207802730

https://twitter.com/gm8xx8/status/1754695360889442586