Visual Document Retrieval: Concepts & Advances

Updated 5 October 2025

Visual Document Retrieval is defined as retrieving visually-rich documents by leveraging both visual and textual cues, including layout, tables, and figures.
VDR systems employ advanced image encoders like Swin Transformers with patch-based tokenization and late interaction techniques to enhance semantic matching.
Key challenges include optimizing storage efficiency and multilingual robustness, prompting innovations such as adaptive pruning and semantic clustering.

Visual Document Retrieval (VDR) is the task of retrieving visually-rich document pages or segments in response to queries—typically textual, but occasionally visual or multimodal—by leveraging models that operate directly on the visual appearance of document images, often augmented or fused with textual semantics. Modern VDR aims to capture both textual and non-textual cues such as layout, tables, figures, and other visual structures, overcoming the limitations of OCR-based systems that often discard layout and visual information essential for reliable evidence localization, question answering, or downstream reasoning in retrieval-augmented generation (RAG) settings.

1. Conceptual Foundations and Motivations

VDR arose from the observation that traditional text-centric pipelines—relying purely on OCR-extracted text for indexing—are insufficient for visually complex documents where semantic meaning is co-determined by layout, font, tables, figures, and spatial relationships. These pipelines exhibit substantial bottlenecks in cases involving noisy OCR, multi-lingual documents, or where information is encoded visually (such as in forms, invoices, scientific articles, or diagrams) (Faysse et al., 27 Jun 2024, Dong et al., 20 May 2025, Osmulski et al., 16 May 2025, Bakkali et al., 2023).

Pioneering VDR frameworks (e.g., ColPali, ColQwen2, GlobalDoc) introduce models that learn to map document page images—and, when available, queries or supporting text—into embedding spaces that reflect global and local image cues, facilitating efficient semantic retrieval at page-, layout-, or even patch-level granularity (Faysse et al., 27 Jun 2024, Qiao et al., 12 May 2025, Bakkali et al., 2023).

2. Model Architectures and Core Mechanisms

Contemporary VDR systems universally employ vision-LLMs (VLMs) or multimodal LLMs (MLLMs), with architectural innovations along several axes:

Image Encoder Backbone: High-capacity backbones such as Swin Transformers or Vision Transformers (ViT) are responsible for extracting patch-based representations while preserving spatial relationships (Liu et al., 10 Apr 2024, Bakkali et al., 2023). Hierarchical and shifted window attention is leveraged for scale handling and local/global context capture.
Tokenization Strategy: Document images are partitioned into fixed-size patches, each yielding an embedding; in multi-vector (late interaction) designs, these embeddings are retained individually rather than pooled (Faysse et al., 27 Jun 2024, Qiao et al., 12 May 2025, Günther et al., 23 Jun 2025).
Late Interaction Mechanism: Query embeddings (typically textual tokens) are scored against all patch embeddings for each document. The late interaction score is maximized for each query token across patches, and summed:

$s_{Q,D} = \sum_{i=1}^{|Q|} \max_{j=1}^{|D|} \mathrm{sim}\left(E_{q_i}, E_{d_j}\right)$

This approach, popularized by ColBERT and adapted in ColPali, ColQwen2, and jina-embeddings-v4, is central to recent VDR advances (Faysse et al., 27 Jun 2024, Qiao et al., 12 May 2025, Günther et al., 23 Jun 2025).

Dual/Hybrid Encoders: Many systems include both visual and textual encoders, with either cross-modal fusion (as in GlobalDoc’s cross-attention transformer (Bakkali et al., 2023)) or parallel unimodal pipelines subsequently fused under constraint (as in VisDoMRAG (Suri et al., 14 Dec 2024)).
Content and Instruction Filtering: Filtering mechanisms remove content-agnostic or instruction-irrelevant tokens at intermediate layers, allowing high-resolution input while constraining the token set for computation and memory efficiency (e.g., HRVDA’s content and instruction filter (Liu et al., 10 Apr 2024)).
Storage and Efficiency Considerations: With late interaction, memory consumption becomes a dominant concern. Solutions include adaptive pruning (DocPruner (Yan et al., 28 Sep 2025)), clustering/merging reductions (Light-ColPali (Ma et al., 5 Jun 2025)), and compression schemes.

3. Objectives, Training, and Benchmarks

Self-supervised and Retrieval-Oriented Pretraining

Contrastive and Clustering Losses: Objectives such as Learning-to-Mine (L2M), Learning-to-Unify (L2U), and Learning-to-Reorganize (L2R) jointly enforce intra- and inter-modality alignment, semantic clustering, and transferable pretraining, enabling robust retrieval in the absence of dense supervision (Bakkali et al., 2023).
Representation Compression: VDocRAG introduces compression pretraining to distill visual cues into dense, semantically-aligned tokens (Tanaka et al., 14 Apr 2025).
Downstream Tasks: Benchmarks include few-shot document image classification, content-based image retrieval, and visual RAG; metrics span Recall@K, nDCG@M, ANLS, F1, and word-overlap-based QA scores (Bakkali et al., 2023, Faysse et al., 27 Jun 2024, Dong et al., 20 May 2025).

Benchmarks

ViDoRe Benchmark V1/V2: Multilingual, multi-domain evaluation environment, incorporating realistic, complex query scenarios and “blind contextual” querying (Macé et al., 22 May 2025).
MIRACL-VISION: Multilingual visual benchmark (18 languages), with hard negative mining and human-generated queries stressing VLM multilingual generalization (Osmulski et al., 16 May 2025).
Jina-VDR: 30-task, multi-domain, multi-modal benchmark specifically designed for visually-rich image/document retrieval (Günther et al., 23 Jun 2025).
MMDocIR: Evaluates page-level and fine-grained layout-level retrieval, highlighting the resource constraints and benefits of both dense and token-level retrieval (Dong et al., 15 Jan 2025).
VisR-Bench, OpenDocVQA, VisDoMBench: Multimodal benchmarks targeting QA, RAG, and robust semantic search at scale and in multilingual/multimodal scenarios (Chen et al., 10 Aug 2025, Tanaka et al., 14 Apr 2025, Suri et al., 14 Dec 2024).

4. Efficiency, Scalability, and Storage

The late interaction (multi-vector) paradigm yields substantial gains in retrieval precision and recall over single-vector methods but incurs orders-of-magnitude increases in memory and computational overhead, especially at scale (Faysse et al., 27 Jun 2024, Qiao et al., 12 May 2025, Yan et al., 28 Sep 2025).

Method	Memory Usage	Retrieval Quality (NDCG@5)	Key Findings
Vanilla ColPali	O(N_patches)	Baseline high	Best quality, largest
Random Pruning	~X% reduction	Severe drop (if >20%)	Not query-adaptive
Semantic Clustering	Order-of-mag ↓	98%–95% retained	Fine-tuning helps
DocPruner	50–60% reduction	≤ 0.4% drop	Adaptive by doc/type

Token merging (semantic clustering) applied late in the encoder pipeline, followed by targeted fine-tuning, achieves optimal storage-reduction/quality trade-offs, outperforming naive or even attention-based pruning. DocPruner employs document-adaptive thresholds (per-document, per-patch attention) to prune as aggressively as possible without loss (Yan et al., 28 Sep 2025, Ma et al., 5 Jun 2025).

Single-vector representations (mean-pooled) remain essential for massive scale and latency-constrained deployments; hybrid systems leveraging truncatable embeddings (Matryoshka Representation Learning (Günther et al., 23 Jun 2025)) support flexible trade-offs.

5. Multimodality, Generalization, and Multilinguality

VDR’s evolution is tightly linked to its ability to generalize across languages, domains, and modalities:

Multimodality: Dual-pipeline systems (e.g., VisDoMRAG) perform both visual and textual retrieval, fusing evidence chains under explicit consistency constraints (Suri et al., 14 Dec 2024). Zero-shot methods (e.g., SERVAL) utilize a generate-and-encode pipeline in which advanced VLMs produce detailed document descriptions, embedded via standard text encoders to facilitate retrieval across languages and layouts (Nguyen et al., 18 Sep 2025).
Multilinguality: Multilingual VLMs and benchmarks (MIRACL-VISION, VisR-Bench) demonstrate that visual models currently lag behind text-based retrievals in non-English and low-resource language scenarios, with performance gaps as high as 59.7% NDCG@10 (Osmulski et al., 16 May 2025, Chen et al., 10 Aug 2025).
Cross-Domain Robustness: Models pre-trained or fine-tuned on datasets with diverse layouts, genres, and instruction modalities show the strongest performance and generalization in retrieval and RAG settings (Liu et al., 10 Apr 2024, Tanaka et al., 14 Apr 2025, Suri et al., 14 Dec 2024).

6. Security, Adversarial Robustness, and Practical Deployment

Security Risks: VDR-based M-RAG systems are susceptible to poisoning attacks via single adversarial images that can maliciously boost their own retrieval score for arbitrary or targeted queries, potentially causing denial-of-service or misinformation outputs (Shereen et al., 2 Apr 2025). This vulnerability exists because M-VLMs can be “fooled” into retrieving the adversarial document, and generation objectives can reinforce the attack through end-to-end adversarial optimization.
Defenses: Filtering of over-retrieved pages, re-ranking, usage of detection-based judging LLMs, and monitoring high retrieval frequencies offer partial mitigation. There is a need for adversarial training and embedding-level robustness in next-generation VDR systems.
Efficiency in Practice: Advanced storage-reduction and post-processing pipelines (adaptive pruning, semantic merging, truncatable representations) are directly applicable to enterprise search, legal and financial document archiving, regulatory compliance, and digital library systems (Ma et al., 5 Jun 2025, Yan et al., 28 Sep 2025).

7. Emerging Directions and Open Challenges

VDR research is increasingly focused on:

Developing storage-efficient late-interaction systems (Light-ColPali, DocPruner), merging strong retrieval with feasible scaling (Ma et al., 5 Jun 2025, Yan et al., 28 Sep 2025).
Architectures and training pipelines (ModernVBERT) that deliver competitive or superior retrieval quality with model sizes an order of magnitude smaller than previously required (Teiletche et al., 1 Oct 2025).
Integrating coarse-grained semantic segmentation (SCAN) for region-level retrieval that balances context and efficiency, and improves both visual and OCR-based RAG stripes (Dong et al., 20 May 2025).
Generalization to advanced and robust multilingual and multimodal retrieval tasks, ensuring effectiveness on real-world, scale-varied, and low-resource scenarios (Macé et al., 22 May 2025, Osmulski et al., 16 May 2025, Chen et al., 10 Aug 2025).
Tighter fusion between retrieval and generative reasoning (ViDoRAG, VisDoMRAG), iterative agent-based RAG, and robust chain-of-thought fusion for improved reasoning and answer verifiability (Wang et al., 25 Feb 2025, Suri et al., 14 Dec 2024).

The explicit open problem remains robust, scalable, and language-agnostic fine-grained retrieval with minimal storage and strong adversarial defenses, across the breadth of visually and semantically heterogeneous document corpora.

VDR is now at the core of document-centric RAG, multimodal QA, and enterprise search pipelines. Its continued evolution—anchored in scalable architectures, storage efficiency, deep semantic alignment, and multilinguality—will shape the next generation of intelligent document understanding systems.