NetraEmbed: Multilingual Retrieval Model
- NetraEmbed is a multilingual multimodal retrieval model featuring a single 4B-parameter transformer that generates unified representations for both document images and text queries.
- It employs Matryoshka truncatable embeddings and a bi-encoder contrastive learning objective, achieving state-of-the-art performance across 22 languages.
- The model leverages synthetic multilingual training data and last-token pooling to optimize efficient and scalable document retrieval with high accuracy.
NetraEmbed is a single-vector, 4-billion-parameter multilingual multimodal retrieval model within the M3DR (Multilingual Multimodal Document Retrieval) framework designed for robust cross-lingual and cross-modal document retrieval at extreme scale. It operates by generating unified representations of both rendered document images and text queries, allowing efficient and accurate semantic retrieval across 22 diverse languages and multiple scripts. The model’s workflow leverages a synthetic multilingual document–query dataset and relies on a shared transformer architecture, contrastive learning, and “Matryoshka” truncatable vector embeddings to achieve state-of-the-art performance in both monolingual and cross-lingual scenarios (Kolavi et al., 3 Dec 2025).
1. Model Architecture
NetraEmbed employs the Gemma 3 4B-IT decoder-only vision–language transformer as its backbone. All input modalities—rendered document images and text queries—pass through identical transformer layers. Document images are first tokenized into approximately 256 visual patch embeddings, while text queries undergo standard tokenization. The transformer processes each sequence, and the final output uses last-token pooling to derive a dense -dimensional representation (). Ablation studies demonstrate that last-token pooling surpasses mean pooling by over 13 NDCG@5 points on the ViDoRe v2 benchmark. The architecture supports “Matryoshka representations,” meaning the leading 768 and 1536 dimensions of the 2560-D embedding are independently valid sub-embeddings, permitting truncation without retraining for efficiency–accuracy trade-offs.
2. Training Objectives
NetraEmbed is trained with a standard bi-encoder contrastive InfoNCE loss using in-batch negatives. Given a batch of paired queries and documents , the loss is
where and . The training objective is augmented for Matryoshka representation learning by summing the base loss for truncations to the first 768, 1536, and all 2560 dimensions with equal weighting: Hard negatives, including nearby pages and spanning mined BM25, CLIP, and multilingual text embeddings, were tested, but in-batch negatives alone yielded superior multilingual transfer stability.
3. Synthetic Multilingual Training Data
NetraEmbed is trained on a large, layout-aware synthetic corpus covering 22 typologically and script-diverse languages. The pipeline starts with approximately 50,000 English document images:
- Layout detection: DocLayout-YOLO and Docling extract semantic regions, tables, figures, and page geometry.
- Neural translation: Each region is translated using NLLB-200 augmented by specialized models for scripts such as Dravidian, with semantic and contextual fidelity.
- Typography and rendering: Pages are re-rendered at 1024–2048 px height using Google Noto Sans families and language-specific typographic rules, ensuring correct alignment of translated text and visual elements.
- Query synthesis: Five distinct queries per image per language are generated by Llama 3.1 90B Vision and Llama 4 Scout, covering factual, long-answer, multiple-choice, and cross-paragraph reasoning types. The result is roughly one million document–query pairs across languages.
4. Embedding Paradigm and Retrieval Workflow
NetraEmbed produces a single 2560-dimensional dense embedding for both document images and queries; at inference, the vector can be truncated to 768 or 1536 dimensions (“Matryoshka truncation”) to trade off between computational efficiency and accuracy. Retrieval is performed using cosine similarity and efficient HNSW-based approximate nearest neighbor search, scaling to billion-document indices and delivering over 500 QPS throughput per GPU. The ColNetraEmbed variant follows a ColBERT-style multi-vector paradigm involving token-level interactions (MaxSim similarity), but incurs significantly higher memory (∼2.5 MB per doc), slower query latency (~10×), and lower accuracy (~10 NDCG points lower on cross-lingual retrieval).
5. Empirical Results, Ablations, and Language Generalization
NetraEmbed is evaluated on the Nayana-IR benchmark (23 datasets, ∼28,700 images, 5,400 queries) in both cross-lingual and monolingual settings, as well as on ViDoRe v2 (English):
- Cross-lingual NDCG@5: 0.716 (Recall@10 = 0.871, MAP@10 = 0.703, MRR@10 = 0.775)
- Monolingual NDCG@5: 0.738 (Recall@10 = 0.844, MAP@10 = 0.709, MRR@10 = 0.751)
- ViDoRe v2 (English) NDCG@5: 0.554 (Recall@10 = 0.637, MAP@10 = 0.437, MRR@10 = 0.647)
Baseline comparison: ColPali-v1.3 achieves 0.284 (cross-lingual) and 0.410 (monolingual) NDCG@5; Jina-embeddings-v4 peaks at 0.435. NetraEmbed thus delivers a 152% relative improvement in cross-lingual settings and 80% in monolingual over the best prior result.
Key ablations:
- Matryoshka truncation: Truncating to 768 D yields 95% of full performance (NDCG@5=0.680), 1536 D yields 98.6% (NDCG@5=0.706).
- Pooling: Last-token pooling outperforms mean pooling by 13 NDCG points on ViDoRe.
- Language scaling: Increasing language coverage from 6 to 22 improves cross-lingual NDCG@5 from ~0.604 to 0.716 (+18 points).
- Loss selection: In-batch InfoNCE loss is both more stable and more effective on multilingual retrieval tasks than hybrid pairwise objectives.
6. Language–Script Coverage and Semantic Alignment
NetraEmbed demonstrates consistent performance (0.65–0.78 NDCG@5) across Latin, Devanagari, Dravidian, CJK, Arabic, and other scripts, with baseline methods frequently falling below 0.30. PCA projections of model embeddings show convergence of initially language-specific clusters into a unified semantic space by step 5000, confirming effective cross-modal and script-agnostic alignment. This cross-lingual robustness differentiates NetraEmbed from prior vision–language baselines, which exhibit substantial performance degradation on non-English scripts.
7. Significance and Engineering Properties
NetraEmbed’s streamlined architecture—single shared transformer, unified loss, large-scale synthetic data, and Matryoshka representation learning—combines high accuracy and computational efficiency. Its design enables real-world deployment at scale (billion-document indices, high GPU throughput) with tunable memory–accuracy trade-offs. Compared with multi-vector paradigms, it delivers superior accuracy and interpretability (via single-vector embedding) while maintaining competitive English performance. The ~150% relative gain in cross-lingual retrieval underscores the practical impact of script-diverse, fully unified embedding and retrieval pipelines (Kolavi et al., 3 Dec 2025).
Plausible implications include improved applicability in global, multilingual document retrieval scenarios, especially for institutions, archives, and search systems confronting diverse scripts and document formats. Furthermore, NetraEmbed’s methodologically clear ablations and reproducible training dataset synthesis provide robust baselines for future work in universal multimodal document understanding.