ColNetraEmbed: Multilingual Multimodal Retrieval

Updated 8 December 2025

ColNetraEmbed is a multilingual and multimodal document retrieval model that uses a ColBERT-style multi-vector late-interaction mechanism to preserve token-level details.
It leverages a vision–language backbone with LoRA adapters and contrastive training on a large synthetic corpus spanning 22 languages, achieving state-of-the-art benchmarks.
The model facilitates fine-grained cross-lingual and cross-script semantic alignments, enabling robust retrieval across diverse languages and visual document formats.

ColNetraEmbed is a multilingual, multimodal document retrieval model introduced in the M3DR framework, specifically designed for robust semantic search across diverse languages and scripts. Unlike models that reduce queries and documents to single dense vectors, ColNetraEmbed adopts a ColBERT-style multi-vector paradigm, preserving token-level granularity for both text and document image modalities. This late-interaction mechanism sustains fine-grained, cross-script alignments necessary for retrieval tasks in global, multilingual environments. ColNetraEmbed leverages a vision–language backbone (Gemma 3 4B-IT) augmented with LoRA adapters, and is trained using contrastive objectives over large-scale synthetic multilingual corpora, yielding state-of-the-art performance in cross-lingual and monolingual settings across twenty-two languages (Kolavi et al., 3 Dec 2025).

1. Model Architecture

ColNetraEmbed utilizes a multi-vector encoding scheme, employing the Gemma 3 4B-IT vision–language backbone for both textual and visual inputs. The query, a sequence of $n_q$ tokens, is mapped to $Q \in \mathbb{R}^{n_q \times h}$ , and the document image, processed into a fixed grid of 256 visual patch tokens, to $D \in \mathbb{R}^{256 \times h}$ ; in both cases, hidden states are L₂-normalized along the embedding dimension ( $h = 128$ ).

Distinctively, ColNetraEmbed eschews pooling operations: each query token embedding $q_i$ and document token $d_j$ is retained for late interaction. The similarity score for each query token is the maximum dot product across all document token embeddings:

$\text{sim}(q_i, d_j) = q_i^\top d_j,\quad \|\ q_i \|\ = \|\ d_j \|\ = 1$

The aggregate document–query score is

$S(Q, D) = \sum_{i=1}^{n_q} \max_{1 \leq j \leq n_d} \text{sim}(q_i, d_j)$

with optional normalization by query length:

$S_{norm}(Q, D) = \frac{S(Q, D)}{n_q}$

This late-interaction mechanism maintains token-level and patch-level information, crucial for cross-lingual and cross-modal alignment.

2. Training Procedure

ColNetraEmbed is optimized via the ColBERT InfoNCE loss, capitalizing on in-batch negatives. For a training batch of size $B$ and positive pairs $(Q_i, D_i^+)$ , the objective is

$L = -\frac{1}{B} \sum_{i=1}^B \log \left[ \frac{\exp(S(Q_i, D_i^+)/\tau)}{\sum_{j=1}^B \exp(S(Q_i, D_j)/\tau)} \right]$

where $\tau = 0.02$ is the temperature parameter. In this regime, all non-matching documents in the batch serve as negatives; no explicit hard negative mining is performed.

Training utilizes AdamW (applied only to LoRA parameters), with a learning rate of $2 \times 10^{-4}$ ; LoRA adapters are configured at rank 32, $\alpha=32$ , and dropout $=0.1$ . Training is distributed over four A100 80GB GPUs with DDP, running 2 epochs over $\approx250,000$ pairs derived from the synthetic corpus, requiring approximately 6–8 hours with mixed precision.

3. Multilingual and Multimodal Alignment

To address cross-lingual and cross-script challenges, ColNetraEmbed is trained on the Nayana IR synthetic parallel corpus, comprising $\approx1$ million document-query pairs spanning 22 languages (including Latin, Devanagari, Dravidian, CJK, Arabic). This corpus is generated from 50,000 English document images by (a) layout region detection, (b) translation via NLLB-200 and language-specific machine translation (MT) models with layout-awareness, (c) rendering using authentic Noto Sans fonts with script-specific typographic rules, and (d) synthesizing five query archetypes per image (factual, long-answer, multiple-choice, cross-paragraph, keyword search) via Llama 3.1 Vision and Llama 4 Scout.

By exposing the model to parallel documents and queries in many scripts and languages, ColNetraEmbed learns to ground script-specific tokens (e.g., Hindi, English, Arabic) to shared visual regions and semantic clusters, facilitating robust retrieval even when queries and documents are expressed in disparate linguistic forms.

4. Evaluation Protocol and Comparative Metrics

Evaluation was conducted on both the Nayana-IR benchmark (22 languages) and the English-centric ViDoRe v2 benchmark. Performance metrics are tabulated below.

Metric	Cross-lingual	Monolingual	English (ViDoRe v2)
NDCG@5	0.637	0.670	0.551
Recall@10	0.700	0.764	-
MAP@10	0.610	0.645	-
MRR@10	0.610	0.686	-

On cross-lingual Nayana-IR retrieval, ColNetraEmbed demonstrates a $+124\%$ relative improvement in NDCG@5 compared to ColPali-v1.3 (from $0.284$ to $0.637$), and a $+63\%$ gain in monolingual NDCG@5 (from $0.410$ to $0.670$). Performance remains competitive (NDCG@5 $= 0.551$ ) on ViDoRe v2, matching English-centric baselines. Per-language breakdowns indicate uniform retrieval quality across all 22 languages and scripts, whereas English-only models fail on non-Latin content.

5. Significance of Late-Interaction Paradigm

The late-interaction architecture, wherein query and document tokens are retained for fine-grained aggregation rather than pooled, enables ColNetraEmbed to preserve semantic and visual associations at the token/patch level. This design is essential for document retrieval in multilingual settings where queries and documents may differ not only lexically but also script-wise. The approach allows accurate grounding of a query token (e.g., Hindi script) to its corresponding region in a document, irrespective of translation or transcription, resulting in script-agnostic retrieval capability.

This suggests that token-level multi-vector interaction provides greater flexibility and accuracy than single-vector models in heterogeneous retrieval corpora. A plausible implication is that future multilingual document retrieval systems should adopt similar architectures to maintain high retrieval fidelity across languages and scripts.

6. Context Within Multimodal Retrieval Research

ColNetraEmbed operationalizes and extends the ColBERT retrieval paradigm to multimodal, multilingual scenarios via vision–language backbones and synthetic corpora. Its deployment demonstrates that universal document retrieval is achievable with carefully designed data synthesis, model adaptation, and interaction schemes. The Nayana IR corpus and evaluation benchmarks further exemplify best practices for testing multilingual and multimodal retrieval systems.

The results support the broader claim that vision-based, ColBERT-style strategies enable state-of-the-art performance on multilingual cross-modal document retrieval tasks, securing robust and accurate matches between queries and documents without language or script restrictions (Kolavi et al., 3 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ColNetraEmbed.