Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 421 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Llama-Embed-Nemotron-8B: Multilingual Embedding Model

Updated 11 November 2025
  • Llama-Embed-Nemotron-8B is an open-weight, instruction-aware multilingual text embedding model designed for retrieval, classification, and semantic similarity tasks.
  • It employs full bi-directional attention with a global average pooling mechanism to generate fixed-dimensional embeddings from tokenized input sequences.
  • Extensive training on 16.1M query-document pairs across 250+ languages, combined with model merging and contrastive learning, drives its state-of-the-art MMTEB performance.

Llama-Embed-Nemotron-8B is an open-weights, instruction-aware multilingual text embedding model based on the Llama-3.1-8B architecture. Designed as a universal solution for retrieval, classification, and semantic textual similarity (STS), it demonstrates state-of-the-art (SOTA) performance across multilingual and cross-lingual settings, as validated by its leading position on the Multilingual Massive Text Embedding Benchmark (MMTEB) as of October 2025. The training process leverages a novel combination of public and synthetically generated data, extensive ablation studies, and model merging strategies to maximize both accuracy and generalizability across over 250 languages.

1. Model Architecture and Embedding Mechanism

The model initializes from Llama-3.1-8B, comprising 32 decoder-only Transformer layers with a hidden size dmodel=4096d_{\text{model}}=4096, 32 attention heads, and feed-forward dimension of 16,384. Departing from its original causal mask, every layer adopts full bi-directional attention to facilitate optimal information propagation for embedding tasks.

Input sequences SS are tokenized and passed through the network, yielding final hidden states HRL×dmodelH \in \mathbb{R}^{L \times d_{\text{model}}}. A global average pooling scheme is applied:

v=1Li=1LHiRdmodelv = \frac{1}{L} \sum_{i=1}^L H_i \in \mathbb{R}^{d_{\text{model}}}

where vv serves as the fixed-dimensional embedding vector for all downstream tasks.

The model is instruction-aware: each input is explicitly prefixed:

1
2
Instruct: {task_instruction}
Query: {T}
allowing user-defined task specialization at inference. A shared encoder is employed, with bi-encoder usage (query/document) for retrieval tasks and uni-encoder operation for STS/classification.

2. Training Data Curation and Preprocessing

The training corpus consists of 16.1 million query-document (Q–D) pairs, partitioned as follows:

Subset Source/Use-case Volume (M Q–D)
Non-synthetic Nemotron-CC-v2 Diverse-QA (pretrain), public corpora (fine-tune: MIRACL, HotpotQA, MS MARCO, NQ, SQuAD, etc.) 7.7
Synthetic Open-weight LLMs SDG (pretrain, fine-tune for retrieval/class/STS/bitext) 8.4

Synthetic data generation uses LLMs such as gpt-oss-20b, gpt-oss-120b, Mixtral-8x22B-Instruct, Llama-3.3-70B-Instruct, Llama-4-Scout-17B, and Llama-4-Maverick-17B. Strategies include end-to-end triplet creation, seed-corpus query generation with hard negative retrieval, back-translation/multilingual dataset augmentation, and retrieval-augmented generation for bitext.

Preprocessing steps standardize the data with Llama’s SentencePiece BPE tokenizer, sequence truncation (max 512 tokens), Unicode normalization, deduplication, and explicit balancing across 250+ languages via per-language quotas.

3. Contrastive Objective, Negative Sampling, and Loss Ablation

Llama-Embed-Nemotron-8B employs the InfoNCE contrastive objective:

L(q,d+,DN)=logexp(sim(q,d+)/τ)di{d+}DNexp(sim(q,di)/τ)\mathcal{L}(q,d^+,D_N) = -\log \frac{\exp(\operatorname{sim}(q,d^+)/\tau)}{\sum_{d_i \in \{d^+\} \cup D_N} \exp(\operatorname{sim}(q,d_i)/\tau)}

with sim(,)\operatorname{sim}(\cdot,\cdot) as cosine similarity and τ=0.02\tau=0.02.

Negative sampling distinguishes pretraining (one hard negative per Q–D) from fine-tuning (four hard negatives per Q–D). Hard negatives are mined using an ensemble (e5-mistral-7B, Qwen3-Embedding-8B) and a "top-k with threshold" rule (sim(q,d)<0.95sim(q,d+)\operatorname{sim}(q,d_{-}) < 0.95 \cdot \operatorname{sim}(q,d^+)).

Comparative ablation with alternatives (Gecko: hard+in-batch+same-tower, Qwen3-Embedding: same-tower, Gemini: hard+in-batch positive) demonstrates that the hard-negative configuration is effective at both scale and performance.

InfoNCE Loss Variant Ablation

Variant Borda Votes Mean(Task) Mean(Type)
Gecko 37,903 63.45 55.86
Qwen3-Embedding 36,835 62.14 55.49
Gemini 38,135 63.83 55.90
Ours (HNs only) 38,225 64.03 56.04

4. Synthetic Data Generation Approaches and Multilingual Strategy

Synthetic data generation is central to model generalization, notably for low-resource languages. The primary strategies encompass:

  1. End-to-end triplet generation from scratch.
  2. Seed-document \rightarrow query \rightarrow hard negatives mining.
  3. Back-translation for enforcing cross-lingual and multilingual diversity.
  4. Retrieval-augmented generation tailored for bitext use-cases.

High-resource datasets are machine-translated by LLMs into target languages; per-language quotas are enforced in the synthetic mix for balanced representation.

Ablation shows that mixing synthetic data sources yields superior MMTEB metrics over exclusive reliance on a single source. Qualitative analysis via human inspection appraises factuality and diversity.

SDG Source Borda Votes Mean(Task) Classification Clustering
No synthetic 37,348 61.95 62.16 49.94
gpt-oss-20b 37,732 62.54 63.71 50.45
Mix of all models 37,812 62.89 64.39 50.95

Empirical results on five classification benchmarks demonstrate that combining 1M synthetic samples with in-domain training data significantly boosts accuracy, especially in non-English and low-resource settings.

5. Model Merging and Instruction Prompting

Six checkpoints, each corresponding to different data regimes or hyperparameter configurations ({Wi}i=16\{W_i\}_{i=1}^6), are merged with equal weights:

Wmerge=16i=16WiW_{\text{merge}} = \frac{1}{6} \sum_{i=1}^6 W_i

This approach, termed "model soups", leads to a +119 Borda vote improvement over the best single checkpoint (39,573 vs 39,454; Mean(Task): 69.46 vs 68.62 on MMTEB).

Instruction encoding is applied purely via runtime prompt manipulation—user-supplied task-instruction strings are prepended—requiring no further weight updates. Embeddings dynamically adapt to instruction context via prompt-tuned attention on the shared encoder.

6. Benchmark Results and Empirical Evaluation

On MMTEB as of October 21, 2025, Llama-Embed-Nemotron-8B ranks first on Borda score:

Model Borda Rank Borda Votes Mean(Task) Mean(Type)
llama-embed-nemotron-8b 1 39,573 69.46 61.09
gemini-embedding-001 2 39,368 68.37 59.59
Qwen3-Embedding-8B 3 39,364 70.58 61.69

Per-type metric evaluations employ Cosine MRR and Recall@10 for retrieval, accuracy for classification, and Pearson/Spearman correlation for STS. The model maintains robust performance in both high- and low-resource language tasks, and in cross-lingual transfer setups.

Consistent advantages are observed over both open- and closed-weight baselines, with stability across diverse language tasks attributed to the data composition, effective negative sampling, and model merging strategies.

7. Practical Deployment and Usage

Inference proceeds at approximately 1 ms per 512-token sequence on an A100 80 GB GPU, with peak memory consumption near 20 GB for batch size 128. A single A100 or V100 is sufficient for typical inference; multi-GPU scaling is applicable for large-scale retrieval scenarios.

Typical usage for embedding extraction employs the HuggingFace Transformers interface:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from transformers import AutoTokenizer, AutoModel
import torch

tok = AutoTokenizer.from_pretrained("nvidia/llama-embed-nemotron-8b")
model = AutoModel.from_pretrained("nvidia/llama-embed-nemotron-8b", trust_remote_code=True)

def embed(texts, instr="Retrieve relevant documents"):
    inputs = tok([f"Instruct: {instr}\nQuery: {t}" for t in texts],
                 return_tensors="pt", padding=True, truncation=True, max_length=512)
    H = model(**inputs).last_hidden_state
    v = H.mean(dim=1)     # global average pooling
    return torch.nn.functional.normalize(v, dim=1)

embs = embed(["What is RAG?"], instr="Retrieve semantically similar text")

Recommended best practices include input truncation/chunking for sequences longer than 512 tokens, use of mean-pooling, embedding normalization, and indexing with vector databases such as FAISS. Downstream adaptation can be performed by attaching learned MLP heads to pooled embeddings for task-specific refinement.


Llama-Embed-Nemotron-8B's open methodology, comprehensive ablation, and instruction-aware interface establish it as a high-performance, flexible backbone for state-of-the-art multilingual retrieval, classification, and semantic similarity applications (Babakhin et al., 10 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Llama-Embed-Nemotron-8b.