Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval

Published 3 Feb 2026 in cs.IR | (2602.03992v1)

Abstract: Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering LLMs by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.2 3B backbone, Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, respectively. The 8B model ranks first on the ViDoRe V3 leaderboard as of February 03, 2026, achieving an average NDCG@10 of 63.42. We describe the main techniques used across data processing, training, and post-training - such as cluster-based sampling, hard-negative mining, bidirectional attention, late interaction, and model merging - that helped us build our top-performing models. We also discuss compute and storage engineering challenges posed by the late interaction mechanism and present experiments on how to balance accuracy and storage with lower dimension embeddings.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a late-interaction VLM embedding architecture that encodes document pages as sequences of visual tokens for fine-grained matching.
It employs contrastive learning with hard-negative mining, cluster-based sampling, and model merging to enhance retrieval accuracy.
Empirical results on ViDoRe and MIRACL benchmarks demonstrate significant performance gains, despite increased storage and inference complexity.

Nemotron ColEmbed V2: Late Interaction VLM Embeddings for Visual Document Retrieval

Introduction and Motivation

Nemotron ColEmbed V2 represents a technically significant advance in the domain of visual document retrieval (VDR), integrating late interaction embedding architectures with vision-LLMs (VLMs) to achieve state-of-the-art performance on complex retrieval benchmarks. The work systematically addresses the limitations of both OCR-centric and single-vector VLM retrieval by leveraging multi-vector late-interaction designs, enabling fine-grained matching between queries and document page images without sacrificing efficiency and scalability.

Architectural Overview

The Nemotron ColEmbed V2 family comprises three key variants, differentiated by parameter count (3B, 4B, 8B) and VLM backbone: Eagle 2/Llama 3.2 for 3B, and Qwen3-VL for 4B/8B models. All leverage SigLIP 2 image encoders and LLM backbones reconfigured to use bidirectional attention for improved global context encoding.

The canonical bi-encoder design, used in many prior approaches, pools the representation of an entire query or document into a single vector, simplifying retrieval but dramatically compressing multimodal and layout information (Figure 1).

Figure 1: Bi-encoder architecture with pooling, using independent projection for query and document to global vectors.

The late interaction approach adopted in Nemotron ColEmbed V2 encodes each document page image as a sequence of visual tokens (via dynamic image tiling), and each query as a sequence of query tokens. During retrieval, token-level similarity is computed exhaustively via the MaxSim operator, aggregating token-wise maximum similarities for a more expressive signal.

Figure 2: llama-nemotron-colembed-vl-3b-v2 architecture, highlighting dynamic image tiling and late interaction scoring.

Core to this approach is the decoupling of image-tiled encoding and late interaction scoring, supporting scalable pre-indexing of document embeddings while enabling rich, token-level matching at inference.

Training Pipeline and Key Techniques

Nemotron ColEmbed V2 employs contrastive learning with InfoNCE loss, targeting maximized similarity for positive query-document pairs and minimized similarity for hard negatives. Several design and training strategies underpin the model's empirical success:

Bidirectional Attention in LLMs: Conversion from standard causal to bidirectional attention in the backbone unlocks access to full token context, yielding improved retrieval accuracy in embedding use cases.
Hard-Negative Mining: Utilizing a retrieval teacher for mining top-k hard negatives, filtered by positive-score margins, ensures high-quality challenging negatives and reduces label noise.
Cluster-Based Data Sampling: Training corpora are balanced by clustering positive contexts (using PCA-reduced embeddings and gap-statistics KMeans), then uniformly sampling to mitigate domain bias.
Cross-Lingual Translation: Augmentation of queries into multiple languages via Qwen3-235B enables robust cross-lingual performance, supporting diverse ViDoRe and MIRACL-Vision tasks.
Model Merging ("Souping"): Ensembling via weighted averaging of model weights, post-training on variations of training blends/hyperparameters, leads to robust aggregation of learned representations and consistently improved NDCG metrics.

The training paradigm is two-stage for the 3B model (text-only, then image), and single-stage (image corpus) for larger models, leveraging pre-trained VLMs for efficient cross-modal alignment.

Empirical Evaluation

Nemotron ColEmbed V2 models establish superior performance on industry-standard benchmarks:

ViDoRe V3: The 8B model attains NDCG@10 of 63.42, a +3% margin over the next closest competitor, ranking first on this large-scale, cross-lingual, multi-domain benchmark. The 4B and 3B variants are also top-ranked for their size class.
ViDoRe V1/V2: The 8B model secures the second position with NDCG@5 of 84.80, closely trailing the leading model and surpassing near-peers of comparable scale.
MIRACL-Vision: The Nemotron ColEmbed V2 suite consistently outperforms alternatives across languages, especially in under-resourced languages, highlighting the efficacy of cross-lingual and cluster-balanced training blends.

In all benchmarks, the models maintain high accuracy with minimal loss when reducing embedding dimension to 512 or 128, highlighting their robustness to storage-accuracy tradeoffs.

Deployment Implications: Storage, Latency, and Efficiency

While late interaction architectures yield the best accuracy, this comes at considerable cost in storage (multi-vector per document) and inference complexity (query-token × doc-token MaxSim computation). For example, nemotron-colembed-vl-8b-v2, at default 4096-dim embedding and 773 tokens/image, incurs 5.9 TB of storage per million pages (fp16), compared to less than 4 GB for a bi-encoder. The storage can be mitigated using projection layers and quantization (float16 or int8), but remains orders of magnitude higher than single-vector alternatives.

Serving latency also scales superlinearly with corpus/document size, demanding specialized vector DB support. The authors discuss alternative pipelines (smaller bi-encoder + cross-encoder reranking) as better trade-offs for some production workloads.

Ablation and Model Compression

A detailed ablation shows that projecting late interaction embeddings to 512- or even 128-dim maintains 96% of the original NDCG@10 while reducing storage by 87–97%. However, even these compressed models incur substantial storage overhead, necessitating further research in quantization and hybrid late/early interaction strategies (e.g., late-pooling, MUVERA FDEs).

Theoretical and Practical Implications

Nemotron ColEmbed V2 establishes a new paradigm for VDR, validating that late-interaction, VLM-based embeddings can exploit document visual modality and layout, significantly outperforming methods limited to text or global representations. The fine-grained multimodal matching, combined with smart data mixing and robust cross-lingual training, generalizes well to real-world, multilingual, multi-domain corpora.

From a systems and algorithmic perspective, the paper provides a comprehensive analysis of the practical challenges—especially model size, storage, and latency tradeoffs—offering pathways to balance accuracy and deployment costs via data-efficient training, model merging, and embedding compression.

Future research is likely to focus on augmenting late-interaction designs with further advances in quantization, dynamic token reduction (late-pooling), and adaptive routing to minimize overhead without sacrificing retrieval fidelity.

Conclusion

Nemotron ColEmbed V2 provides a high-performance family of late-interaction VLMs for visual document retrieval, decisively advancing SOTA on complex benchmarks and offering robust support for multilingual and cross-domain scenarios. The comprehensive methodology—integrating architectural innovations, rigorous training techniques, and system-level tradeoff analysis—sets a new technical benchmark for the field, while also illuminating the computational and engineering frontiers that remain for scalable VDR in enterprise and open-domain applications.

Markdown