Llama-Embed-Nemotron-8B: Multilingual Embedding Model

Updated 11 November 2025

Llama-Embed-Nemotron-8B is an open-weight, instruction-aware multilingual text embedding model designed for retrieval, classification, and semantic similarity tasks.
It employs full bi-directional attention with a global average pooling mechanism to generate fixed-dimensional embeddings from tokenized input sequences.
Extensive training on 16.1M query-document pairs across 250+ languages, combined with model merging and contrastive learning, drives its state-of-the-art MMTEB performance.

Llama-Embed-Nemotron-8B is an open-weights, instruction-aware multilingual text embedding model based on the Llama-3.1-8B architecture. Designed as a universal solution for retrieval, classification, and semantic textual similarity (STS), it demonstrates state-of-the-art (SOTA) performance across multilingual and cross-lingual settings, as validated by its leading position on the Multilingual Massive Text Embedding Benchmark (MMTEB) as of October 2025. The training process leverages a novel combination of public and synthetically generated data, extensive ablation studies, and model merging strategies to maximize both accuracy and generalizability across over 250 languages.

1. Model Architecture and Embedding Mechanism

The model initializes from Llama-3.1-8B, comprising 32 decoder-only Transformer layers with a hidden size $d_{\text{model}}=4096$ , 32 attention heads, and feed-forward dimension of 16,384. Departing from its original causal mask, every layer adopts full bi-directional attention to facilitate optimal information propagation for embedding tasks.

Input sequences $S$ are tokenized and passed through the network, yielding final hidden states $H \in \mathbb{R}^{L \times d_{\text{model}}}$ . A global average pooling scheme is applied:

$v = \frac{1}{L} \sum_{i=1}^L H_i \in \mathbb{R}^{d_{\text{model}}}$

where $v$ serves as the fixed-dimensional embedding vector for all downstream tasks.

The model is instruction-aware: each input is explicitly prefixed:

1 2	Instruct: {task_instruction} Query: {T}

allowing user-defined task specialization at inference. A shared encoder is employed, with bi-encoder usage (query/document) for retrieval tasks and uni-encoder operation for STS/classification.

2. Training Data Curation and Preprocessing

The training corpus consists of 16.1 million query-document (Q–D) pairs, partitioned as follows:

Subset	Source/Use-case	Volume (M Q–D)
Non-synthetic	Nemotron-CC-v2 Diverse-QA (pretrain), public corpora (fine-tune: MIRACL, HotpotQA, MS MARCO, NQ, SQuAD, etc.)	7.7
Synthetic	Open-weight LLMs SDG (pretrain, fine-tune for retrieval/class/STS/bitext)	8.4

Synthetic data generation uses LLMs such as gpt-oss-20b, gpt-oss-120b, Mixtral-8x22B-Instruct, Llama-3.3-70B-Instruct, Llama-4-Scout-17B, and Llama-4-Maverick-17B. Strategies include end-to-end triplet creation, seed-corpus query generation with hard negative retrieval, back-translation/multilingual dataset augmentation, and retrieval-augmented generation for bitext.

Preprocessing steps standardize the data with Llama’s SentencePiece BPE tokenizer, sequence truncation (max 512 tokens), Unicode normalization, deduplication, and explicit balancing across 250+ languages via per-language quotas.

3. Contrastive Objective, Negative Sampling, and Loss Ablation

Llama-Embed-Nemotron-8B employs the InfoNCE contrastive objective:

$\mathcal{L}(q,d^+,D_N) = -\log \frac{\exp(\operatorname{sim}(q,d^+)/\tau)}{\sum_{d_i \in \{d^+\} \cup D_N} \exp(\operatorname{sim}(q,d_i)/\tau)}$

with $\operatorname{sim}(\cdot,\cdot)$ as cosine similarity and $\tau=0.02$ .

Negative sampling distinguishes pretraining (one hard negative per Q–D) from fine-tuning (four hard negatives per Q–D). Hard negatives are mined using an ensemble (e5-mistral-7B, Qwen3-Embedding-8B) and a "top-k with threshold" rule ( $\operatorname{sim}(q,d_{-}) < 0.95 \cdot \operatorname{sim}(q,d^+)$ ).

Comparative ablation with alternatives (Gecko: hard+in-batch+same-tower, Qwen3-Embedding: same-tower, Gemini: hard+in-batch positive) demonstrates that the hard-negative configuration is effective at both scale and performance.

InfoNCE Loss Variant Ablation

Variant	Borda Votes	Mean(Task)	Mean(Type)
Gecko	37,903	63.45	55.86
Qwen3-Embedding	36,835	62.14	55.49
Gemini	38,135	63.83	55.90
Ours (HNs only)	38,225	64.03	56.04

4. Synthetic Data Generation Approaches and Multilingual Strategy

Synthetic data generation is central to model generalization, notably for low-resource languages. The primary strategies encompass:

End-to-end triplet generation from scratch.
Seed-document $\rightarrow$ query $\rightarrow$ hard negatives mining.
Back-translation for enforcing cross-lingual and multilingual diversity.
Retrieval-augmented generation tailored for bitext use-cases.

High-resource datasets are machine-translated by LLMs into target languages; per-language quotas are enforced in the synthetic mix for balanced representation.

Ablation shows that mixing synthetic data sources yields superior MMTEB metrics over exclusive reliance on a single source. Qualitative analysis via human inspection appraises factuality and diversity.

SDG Source	Borda Votes	Mean(Task)	Classification	Clustering
No synthetic	37,348	61.95	62.16	49.94
gpt-oss-20b	37,732	62.54	63.71	50.45
Mix of all models	37,812	62.89	64.39	50.95

Empirical results on five classification benchmarks demonstrate that combining 1M synthetic samples with in-domain training data significantly boosts accuracy, especially in non-English and low-resource settings.

5. Model Merging and Instruction Prompting

Six checkpoints, each corresponding to different data regimes or hyperparameter configurations ( $\{W_i\}_{i=1}^6$ ), are merged with equal weights:

$W_{\text{merge}} = \frac{1}{6} \sum_{i=1}^6 W_i$

This approach, termed "model soups", leads to a +119 Borda vote improvement over the best single checkpoint (39,573 vs 39,454; Mean(Task): 69.46 vs 68.62 on MMTEB).

Instruction encoding is applied purely via runtime prompt manipulation—user-supplied task-instruction strings are prepended—requiring no further weight updates. Embeddings dynamically adapt to instruction context via prompt-tuned attention on the shared encoder.

6. Benchmark Results and Empirical Evaluation

On MMTEB as of October 21, 2025, Llama-Embed-Nemotron-8B ranks first on Borda score:

Model	Borda Rank	Borda Votes	Mean(Task)	Mean(Type)
llama-embed-nemotron-8b	1	39,573	69.46	61.09
gemini-embedding-001	2	39,368	68.37	59.59
Qwen3-Embedding-8B	3	39,364	70.58	61.69

Per-type metric evaluations employ Cosine MRR and Recall@10 for retrieval, accuracy for classification, and Pearson/Spearman correlation for STS. The model maintains robust performance in both high- and low-resource language tasks, and in cross-lingual transfer setups.

Consistent advantages are observed over both open- and closed-weight baselines, with stability across diverse language tasks attributed to the data composition, effective negative sampling, and model merging strategies.

7. Practical Deployment and Usage

Inference proceeds at approximately 1 ms per 512-token sequence on an A100 80 GB GPU, with peak memory consumption near 20 GB for batch size 128. A single A100 or V100 is sufficient for typical inference; multi-GPU scaling is applicable for large-scale retrieval scenarios.

Typical usage for embedding extraction employs the HuggingFace Transformers interface:

from transformers import AutoTokenizer, AutoModel
import torch

tok = AutoTokenizer.from_pretrained("nvidia/llama-embed-nemotron-8b")
model = AutoModel.from_pretrained("nvidia/llama-embed-nemotron-8b", trust_remote_code=True)

def embed(texts, instr="Retrieve relevant documents"):
    inputs = tok([f"Instruct: {instr}\nQuery: {t}" for t in texts],
                 return_tensors="pt", padding=True, truncation=True, max_length=512)
    H = model(**inputs).last_hidden_state
    v = H.mean(dim=1)     # global average pooling
    return torch.nn.functional.normalize(v, dim=1)

embs = embed(["What is RAG?"], instr="Retrieve semantically similar text")

Recommended best practices include input truncation/chunking for sequences longer than 512 tokens, use of mean-pooling, embedding normalization, and indexing with vector databases such as FAISS. Downstream adaptation can be performed by attaching learned MLP heads to pooled embeddings for task-specific refinement.

Llama-Embed-Nemotron-8B's open methodology, comprehensive ablation, and instruction-aware interface establish it as a high-performance, flexible backbone for state-of-the-art multilingual retrieval, classification, and semantic similarity applications (Babakhin et al., 10 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Llama-Embed-Nemotron-8b.