BGE-M3: Multilingual Bi-Encoder Model

Updated 6 December 2025

BGE-M3 is a high-capacity, multilingual bi-encoder model enabling dense, sparse, and multi-vector retrieval across text granularities up to 8,192 tokens.
It integrates three parallel retrieval heads and leverages self-knowledge distillation to achieve state-of-the-art performance on monolingual, cross-lingual, and long-document tasks.
Widely deployed in retrieval-augmented generation pipelines, BGE-M3 delivers superior precision, recall, and faithfulness across over 100 languages.

BGE-M3 (“BGE M3-Embedding”) is a high-capacity, multilingual, multi-functionality bi-encoder embedding model that enables dense, sparse, and multi-vector retrieval at multiple text granularities (sentence, passage, document) and supports input lengths up to 8,192 tokens. Developed using self-knowledge distillation across retrieval paradigms, BGE-M3 delivers state-of-the-art performance across monolingual, cross-lingual, and long-document retrieval tasks. It is widely deployed as the embedding backbone in contemporary Retrieval-Augmented Generation (RAG) pipelines and IR architectures for over 100 languages, including low- and high-resource settings (Chen et al., 5 Feb 2024, Alsubhi et al., 1 Jun 2025, Yang et al., 8 Jan 2025).

1. Model Architecture and Retrieval Paradigms

BGE-M3 is fundamentally a bi-encoder Transformer derived from XLM-RoBERTa-large (24 layers, 1024-hidden, 16 heads) and enhanced with retrofitted masked autoencoding (RetroMAE). The architecture is distinguished by three parallel retrieval heads attached to the terminal encoder: dense, sparse (lexical), and multi-vector (Chen et al., 5 Feb 2024).

Dense retrieval: Uses $[CLS]$ token representations (projected and normalized), scoring via inner-product similarity.
Sparse retrieval: Learns lexical weights for each token position and aggregates for overlapping terms.
Multi-vector: Projects all per-token vectors, applies row normalization, and aggregates via maximal late interaction as in ColBERT.

For query $q$ and passage $p$ with hidden states $H_q, H_p$ , the three scores are computed as:

$s_{\text{dense}}(q,p) = \langle e_q, e_p \rangle$ where $e_q = \mathrm{norm}(H_q[0])$ ,
$s_{\text{lex}}(q,p) = \sum_{t \in q \cap p} w_{q,t} \cdot w_{p,t}$ with $w_{q,i} = \mathrm{ReLU}(w_{\text{lex}}^T H_q[i])$ ,
$s_{\text{mul}}(q,p) = \frac{1}{|q|} \sum_{i=1}^{|q|} \max_j \left( E_q[i] \cdot E_p[j] \right)$ , using $E_q = \mathrm{norm}(H_q W_{\text{mul}})$ .

Any combination of dense, sparse, and multi-vector heads can be used for candidate retrieval and reranking. Passage inputs up to 8,192 tokens are accommodated through extended positional embeddings and batching strategies designed for efficient long-document processing (Chen et al., 5 Feb 2024).

2. Training Objectives and Self-Knowledge Distillation

BGE-M3 leverages a combination of contrastive (InfoNCE) loss and self-knowledge distillation. For each retrieval mode $(*) \in \{\text{dense}, \text{lex}, \text{mul}\}$ , infoNCE is applied as:

$L_* = -\log \frac{\exp\left(s_*(q, p^+) / \tau\right)}{\sum_{p \in \{p^+\} \cup P^-} \exp(s_*(q, p) / \tau)},$

where $P^-$ is the set of in-batch negatives. Self-knowledge distillation is implemented by constructing a teacher signal $s_{\text{inter}} = s_{\text{dense}} + s_{\text{lex}} + s_{\text{mul}}$ and distilling its softened distribution into each retrieval head, minimizing KL divergence between teacher and student distributions. The combined loss is $L_{\text{final}} = L_{\text{dense}} + L_{\text{lex}} + L_{\text{mul}} + L'$ (Chen et al., 5 Feb 2024).

Large-batch strategies and in-batch negatives are made practical by grouping samples with similar lengths and exploiting sub-batching with gradient checkpointing, achieving high throughput even for long-sequence examples.

3. Multilingual and Multi-Granular Capabilities

The model was pretrained on approximately 1.2 billion unsupervised text pairs and further fine-tuned on English, Chinese, and multilingual labeled retrieval datasets. Languages from Wikipedia, mC4, CC-News, S2ORC, and other corpora are included, as well as parallel and Q-A synthetic pairs (Chen et al., 5 Feb 2024, Alsubhi et al., 1 Jun 2025):

Multilinguality: All languages are mapped into a unified semantic space by sharing the Transformer encoder weights and retrieval heads, ensuring robust cross-lingual alignment.
Multi-granularity: BGE-M3 encodes sentences, paragraphs, and full documents, supporting both coarse-grained (e.g., document) and fine-grained (e.g., phrase/sentence) semantic search.
Long-text support: Inputs up to 8,192 tokens are handled, with the inference-time option to insert additional [CLS] tokens regularly for MCLS-pooling.

Self-knowledge distillation across input granularities enforces representational consistency regardless of input span, shown to be particularly beneficial when handling morphologically rich languages such as Arabic (Alsubhi et al., 1 Jun 2025).

4. Integration into Retrieval-Augmented Generation Pipelines

BGE-M3 serves as a core embedding backbone in modern RAG systems. The standard procedure involves:

Chunking: Source documents are segmented (often with sentence-aware chunking) to preserve semantic boundaries.
Indexing: Each chunk is embedded via a BGE-M3 forward pass; embeddings are stored in a FAISS index (cosine similarity, 10K vectors/shard typical).
Retrieval: At query time, the user input is encoded; top- $k$ (typically $k=10$ ) neighbors are retrieved.
Reranking (optional): Retrieved candidates can be reranked by downstream cross-encoders (e.g., bge-reranker-v2-m3).
Answer generation: Final contexts are provided to generation models (e.g., Aya-8B, StableLM) (Alsubhi et al., 1 Jun 2025, Yang et al., 8 Jan 2025).

A summary of this pipeline is shown below:

Stage	Key Component	Setting/Detail
Chunking	Sentence-aware	Full sentences, avg. ~30 tokens/chunk
Index	FAISS (cosine/inner-product)	10K vectors/shard
Retrieval	Top- $k$ (bi-encoder)	$k=10$ per query
Reranking	bge-reranker-v2-m3 (optional)	Rescore top-10 only
Generation	LLM (Aya-8B, StableLM, etc.)	Uses reranked contexts

Sentence-aware chunking is empirically optimal for Arabic, maximizing semantic integrity and retrieval robustness (Alsubhi et al., 1 Jun 2025).

5. Empirical Performance and Benchmarking

BGE-M3 consistently achieves state-of-the-art retrieval, recall, and answer quality across monolingual, cross-lingual, and long-context retrieval. In the context of RAG for Arabic, it outperforms or matches leading transferable models such as multilingual-e5-large:

Mean RAGAS Score (six Arabic datasets): BGE-M3: 70.99, multilingual-e5-large: 70.31; best among all baselines (Alsubhi et al., 1 Jun 2025).
Precision/Recall (mean): BGE-M3 Precision ≈ 74.2, Recall ≈ 71.8; relative gains over e5-large.
Faithfulness/Relevancy: BGE-M3 leads with +1.5 points in faithfulness and +0.8 in relevancy.
Per-dataset wins: Notable on retrieval-heavy and high-inflection datasets (e.g., ARCD, Quran Tafseer).

On MIRACL (nDCG@10, 18 languages), M3-dense yields 67.8 (vs. E5-large 65.4); combining all retrieval heads achieves 70.0. Cross-lingual benchmarks, such as MKQA (Recall@100), also show substantial improvements (Chen et al., 5 Feb 2024).

Model	Mean Score (Arabic 6-dataset, RAGAS)
Snowflake-arctic-embed...	69.48
Arabic-mpnet-base-all-nli	45.92
Arabic-Triplet-Matryoshka	66.46
gte-multilingual-base	68.48
multilingual-e5-large	70.31
BGE-M3	70.99

Table: Summary of RAGAS scores for Arabic RAG pipeline (Alsubhi et al., 1 Jun 2025).

6. Practical Deployment, Recommendations, and Limitations

Deployment encompasses high-throughput sub-second first-stage retrieval at scale ( $\sim 10^6$ docs/sec/GPU for dense/sparse), with reranking reserved for latency-tolerant passes (e.g., top-200). The core model is ~550M parameters; inference for 512-token queries typically requires 5–10 ms on an A100 (Chen et al., 5 Feb 2024).

Practical recommendations for maximizing BGE-M3 effectiveness:

Use sentence-aware chunking for Arabic and languages with variable word order or morphology.
Index embeddings in FAISS with cosine similarity; default to $k=10$ retrieval.
Employ reranking (bge-reranker-v2-m3) where faithfulness or cross-paragraph reasoning is required.
For long-document domains, hybrid chunking strategies (mixing sentence and semantic boundaries) may further boost performance (Alsubhi et al., 1 Jun 2025).

Limitations and outstanding issues:

Domain adaptation (fine-tuning on legal, medical, or domain-specific corpora) is a suggested avenue for further improvement.
All Arabic experiments are on Modern Standard Arabic; performance on dialect variants remains unevaluated.
RAGAS provides automated faithfulness/relevancy scoring; critical applications warrant complementary manual validation.
Some index-level hyperparameters and model-internal details (e.g., vocabulary size, tokenizer specifics) are not always provided in downstream RAG studies and should be referenced from the primary BGE-M3 release (Chen et al., 5 Feb 2024, Yang et al., 8 Jan 2025, Alsubhi et al., 1 Jun 2025).

7. Broader Impact and Applicability

As a unified, multilingual, multi-retrieval embedding backbone, BGE-M3 is central to retrieval-based LLM applications. Its architecture enables:

Elastic retrieval across IR and QA scenarios, spanning short queries to ultra-long passages.
Dense, sparse, and hybrid retrieval at scale, improving recall and answer faithfulness in knowledge-intensive pipelines.
High accuracy in both resource-rich and low-resource language settings, enhancing information access equity (Chen et al., 5 Feb 2024, Alsubhi et al., 1 Jun 2025).

BGE-M3 is increasingly adopted in both open-domain information retrieval and specific RAG architectures, including knowledge bases that require data privacy (local operation without uploading queries), exemplary in recent legal/finance RAG applications (Yang et al., 8 Jan 2025).

In summary, BGE-M3 represents an evolution in multilingual embedding modelling, distinguished by multi-functionality and its training via self-knowledge distillation, yielding superior retrieval performance across modalities, domains, and linguistic boundaries.