Papers
Topics
Authors
Recent
2000 character limit reached

BGE-M3: Multilingual Bi-Encoder Model

Updated 6 December 2025
  • BGE-M3 is a high-capacity, multilingual bi-encoder model enabling dense, sparse, and multi-vector retrieval across text granularities up to 8,192 tokens.
  • It integrates three parallel retrieval heads and leverages self-knowledge distillation to achieve state-of-the-art performance on monolingual, cross-lingual, and long-document tasks.
  • Widely deployed in retrieval-augmented generation pipelines, BGE-M3 delivers superior precision, recall, and faithfulness across over 100 languages.

BGE-M3 (“BGE M3-Embedding”) is a high-capacity, multilingual, multi-functionality bi-encoder embedding model that enables dense, sparse, and multi-vector retrieval at multiple text granularities (sentence, passage, document) and supports input lengths up to 8,192 tokens. Developed using self-knowledge distillation across retrieval paradigms, BGE-M3 delivers state-of-the-art performance across monolingual, cross-lingual, and long-document retrieval tasks. It is widely deployed as the embedding backbone in contemporary Retrieval-Augmented Generation (RAG) pipelines and IR architectures for over 100 languages, including low- and high-resource settings (Chen et al., 5 Feb 2024, Alsubhi et al., 1 Jun 2025, Yang et al., 8 Jan 2025).

1. Model Architecture and Retrieval Paradigms

BGE-M3 is fundamentally a bi-encoder Transformer derived from XLM-RoBERTa-large (24 layers, 1024-hidden, 16 heads) and enhanced with retrofitted masked autoencoding (RetroMAE). The architecture is distinguished by three parallel retrieval heads attached to the terminal encoder: dense, sparse (lexical), and multi-vector (Chen et al., 5 Feb 2024).

  • Dense retrieval: Uses [CLS][CLS] token representations (projected and normalized), scoring via inner-product similarity.
  • Sparse retrieval: Learns lexical weights for each token position and aggregates for overlapping terms.
  • Multi-vector: Projects all per-token vectors, applies row normalization, and aggregates via maximal late interaction as in ColBERT.

For query qq and passage pp with hidden states Hq,HpH_q, H_p, the three scores are computed as:

  • sdense(q,p)=eq,eps_{\text{dense}}(q,p) = \langle e_q, e_p \rangle where eq=norm(Hq[0])e_q = \mathrm{norm}(H_q[0]),
  • slex(q,p)=tqpwq,twp,ts_{\text{lex}}(q,p) = \sum_{t \in q \cap p} w_{q,t} \cdot w_{p,t} with wq,i=ReLU(wlexTHq[i])w_{q,i} = \mathrm{ReLU}(w_{\text{lex}}^T H_q[i]),
  • smul(q,p)=1qi=1qmaxj(Eq[i]Ep[j])s_{\text{mul}}(q,p) = \frac{1}{|q|} \sum_{i=1}^{|q|} \max_j \left( E_q[i] \cdot E_p[j] \right), using Eq=norm(HqWmul)E_q = \mathrm{norm}(H_q W_{\text{mul}}).

Any combination of dense, sparse, and multi-vector heads can be used for candidate retrieval and reranking. Passage inputs up to 8,192 tokens are accommodated through extended positional embeddings and batching strategies designed for efficient long-document processing (Chen et al., 5 Feb 2024).

2. Training Objectives and Self-Knowledge Distillation

BGE-M3 leverages a combination of contrastive (InfoNCE) loss and self-knowledge distillation. For each retrieval mode (){dense,lex,mul}(*) \in \{\text{dense}, \text{lex}, \text{mul}\}, infoNCE is applied as:

L=logexp(s(q,p+)/τ)p{p+}Pexp(s(q,p)/τ),L_* = -\log \frac{\exp\left(s_*(q, p^+) / \tau\right)}{\sum_{p \in \{p^+\} \cup P^-} \exp(s_*(q, p) / \tau)},

where PP^- is the set of in-batch negatives. Self-knowledge distillation is implemented by constructing a teacher signal sinter=sdense+slex+smuls_{\text{inter}} = s_{\text{dense}} + s_{\text{lex}} + s_{\text{mul}} and distilling its softened distribution into each retrieval head, minimizing KL divergence between teacher and student distributions. The combined loss is Lfinal=Ldense+Llex+Lmul+LL_{\text{final}} = L_{\text{dense}} + L_{\text{lex}} + L_{\text{mul}} + L' (Chen et al., 5 Feb 2024).

Large-batch strategies and in-batch negatives are made practical by grouping samples with similar lengths and exploiting sub-batching with gradient checkpointing, achieving high throughput even for long-sequence examples.

3. Multilingual and Multi-Granular Capabilities

The model was pretrained on approximately 1.2 billion unsupervised text pairs and further fine-tuned on English, Chinese, and multilingual labeled retrieval datasets. Languages from Wikipedia, mC4, CC-News, S2ORC, and other corpora are included, as well as parallel and Q-A synthetic pairs (Chen et al., 5 Feb 2024, Alsubhi et al., 1 Jun 2025):

  • Multilinguality: All languages are mapped into a unified semantic space by sharing the Transformer encoder weights and retrieval heads, ensuring robust cross-lingual alignment.
  • Multi-granularity: BGE-M3 encodes sentences, paragraphs, and full documents, supporting both coarse-grained (e.g., document) and fine-grained (e.g., phrase/sentence) semantic search.
  • Long-text support: Inputs up to 8,192 tokens are handled, with the inference-time option to insert additional [CLS] tokens regularly for MCLS-pooling.

Self-knowledge distillation across input granularities enforces representational consistency regardless of input span, shown to be particularly beneficial when handling morphologically rich languages such as Arabic (Alsubhi et al., 1 Jun 2025).

4. Integration into Retrieval-Augmented Generation Pipelines

BGE-M3 serves as a core embedding backbone in modern RAG systems. The standard procedure involves:

  1. Chunking: Source documents are segmented (often with sentence-aware chunking) to preserve semantic boundaries.
  2. Indexing: Each chunk is embedded via a BGE-M3 forward pass; embeddings are stored in a FAISS index (cosine similarity, 10K vectors/shard typical).
  3. Retrieval: At query time, the user input is encoded; top-kk (typically k=10k=10) neighbors are retrieved.
  4. Reranking (optional): Retrieved candidates can be reranked by downstream cross-encoders (e.g., bge-reranker-v2-m3).
  5. Answer generation: Final contexts are provided to generation models (e.g., Aya-8B, StableLM) (Alsubhi et al., 1 Jun 2025, Yang et al., 8 Jan 2025).

A summary of this pipeline is shown below:

Stage Key Component Setting/Detail
Chunking Sentence-aware Full sentences, avg. ~30 tokens/chunk
Index FAISS (cosine/inner-product) 10K vectors/shard
Retrieval Top-kk (bi-encoder) k=10k=10 per query
Reranking bge-reranker-v2-m3 (optional) Rescore top-10 only
Generation LLM (Aya-8B, StableLM, etc.) Uses reranked contexts

Sentence-aware chunking is empirically optimal for Arabic, maximizing semantic integrity and retrieval robustness (Alsubhi et al., 1 Jun 2025).

5. Empirical Performance and Benchmarking

BGE-M3 consistently achieves state-of-the-art retrieval, recall, and answer quality across monolingual, cross-lingual, and long-context retrieval. In the context of RAG for Arabic, it outperforms or matches leading transferable models such as multilingual-e5-large:

  • Mean RAGAS Score (six Arabic datasets): BGE-M3: 70.99, multilingual-e5-large: 70.31; best among all baselines (Alsubhi et al., 1 Jun 2025).
  • Precision/Recall (mean): BGE-M3 Precision ≈ 74.2, Recall ≈ 71.8; relative gains over e5-large.
  • Faithfulness/Relevancy: BGE-M3 leads with +1.5 points in faithfulness and +0.8 in relevancy.
  • Per-dataset wins: Notable on retrieval-heavy and high-inflection datasets (e.g., ARCD, Quran Tafseer).

On MIRACL (nDCG@10, 18 languages), M3-dense yields 67.8 (vs. E5-large 65.4); combining all retrieval heads achieves 70.0. Cross-lingual benchmarks, such as MKQA (Recall@100), also show substantial improvements (Chen et al., 5 Feb 2024).

Model Mean Score (Arabic 6-dataset, RAGAS)
Snowflake-arctic-embed... 69.48
Arabic-mpnet-base-all-nli 45.92
Arabic-Triplet-Matryoshka 66.46
gte-multilingual-base 68.48
multilingual-e5-large 70.31
BGE-M3 70.99

Table: Summary of RAGAS scores for Arabic RAG pipeline (Alsubhi et al., 1 Jun 2025).

6. Practical Deployment, Recommendations, and Limitations

Deployment encompasses high-throughput sub-second first-stage retrieval at scale (106\sim 10^6 docs/sec/GPU for dense/sparse), with reranking reserved for latency-tolerant passes (e.g., top-200). The core model is ~550M parameters; inference for 512-token queries typically requires 5–10 ms on an A100 (Chen et al., 5 Feb 2024).

Practical recommendations for maximizing BGE-M3 effectiveness:

  • Use sentence-aware chunking for Arabic and languages with variable word order or morphology.
  • Index embeddings in FAISS with cosine similarity; default to k=10k=10 retrieval.
  • Employ reranking (bge-reranker-v2-m3) where faithfulness or cross-paragraph reasoning is required.
  • For long-document domains, hybrid chunking strategies (mixing sentence and semantic boundaries) may further boost performance (Alsubhi et al., 1 Jun 2025).

Limitations and outstanding issues:

  • Domain adaptation (fine-tuning on legal, medical, or domain-specific corpora) is a suggested avenue for further improvement.
  • All Arabic experiments are on Modern Standard Arabic; performance on dialect variants remains unevaluated.
  • RAGAS provides automated faithfulness/relevancy scoring; critical applications warrant complementary manual validation.
  • Some index-level hyperparameters and model-internal details (e.g., vocabulary size, tokenizer specifics) are not always provided in downstream RAG studies and should be referenced from the primary BGE-M3 release (Chen et al., 5 Feb 2024, Yang et al., 8 Jan 2025, Alsubhi et al., 1 Jun 2025).

7. Broader Impact and Applicability

As a unified, multilingual, multi-retrieval embedding backbone, BGE-M3 is central to retrieval-based LLM applications. Its architecture enables:

  • Elastic retrieval across IR and QA scenarios, spanning short queries to ultra-long passages.
  • Dense, sparse, and hybrid retrieval at scale, improving recall and answer faithfulness in knowledge-intensive pipelines.
  • High accuracy in both resource-rich and low-resource language settings, enhancing information access equity (Chen et al., 5 Feb 2024, Alsubhi et al., 1 Jun 2025).

BGE-M3 is increasingly adopted in both open-domain information retrieval and specific RAG architectures, including knowledge bases that require data privacy (local operation without uploading queries), exemplary in recent legal/finance RAG applications (Yang et al., 8 Jan 2025).

In summary, BGE-M3 represents an evolution in multilingual embedding modelling, distinguished by multi-functionality and its training via self-knowledge distillation, yielding superior retrieval performance across modalities, domains, and linguistic boundaries.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to BGE-M3 Model.