ViDoRe Benchmarks for Visual Retrieval
- ViDoRe Benchmarks are a comprehensive suite of datasets and protocols for visual document and video retrieval, supporting rigorous evaluation of RAG systems.
- They evolved through successive releases (V1–V3) to address limitations in document diversity, query complexity, and multilingual challenges.
- The benchmarks underpin state-of-the-art model innovations with metrics like nDCG@k, recall, mAP, and visual grounding, while exposing trade-offs in storage and latency.
ViDoRe Benchmarks define a standard suite of datasets and evaluation protocols for vision-centric document and video retrieval, supporting the assessment of retrieval-augmented generation (RAG) systems and embedding models over complex, multimodal, and multilingual corpora. Evolving through three major releases (V1–V3 for documents; recent video retrieval adaptation), ViDoRe benchmarks have become a central resource for measuring and advancing first-stage visual retrieval and end-to-end RAG pipelines in real-world, high-complexity scenarios.
1. Benchmark Evolution, Scope, and Design
The ViDoRe ("Visual Document Retrieval") suite was created to address the shortcomings of prior retrieval benchmarks—namely, their limited coverage of visual document types, saturation on synthetic/extractive queries, and neglect of multilingual and multi-hop scenarios. The suite has expanded in scale and rigor through three key releases:
- ViDoRe V1: Technical PDFs and presentation slides across ten sub-tasks, queries as free-form questions, in-domain train/val/test splits, graded human relevance labels. Focus: scientific and industry domains. Evaluation: nDCG@5 (Macé et al., 22 May 2025, Xu et al., 7 Jul 2025).
- ViDoRe V2: Reuse of V1 documents but zero-shot non-overlapping sub-tasks (multilingual scientific/industry reports, synthetic multilingual ESG data), no in-domain training data, multilingual query support, and more realistic query complexities. Evaluation: nDCG@5 (Macé et al., 22 May 2025, Xu et al., 7 Jul 2025).
- ViDoRe V3: Ten professional domains (~26,000 page images; queries in 6 languages), designed for rigorous end-to-end RAG evaluation, including retrieval, visual grounding (bounding box localization), and answer generation with human-reviewed annotations. Query types span open-ended, multi-hop, compare/contrast. Evaluation: nDCG@10, recall@K, mAP, IoU, F1, and answer correctness (Loison et al., 13 Jan 2026, Moreira et al., 3 Feb 2026).
Editor's term: “Document ViDoRe" refers specifically to these three sequential releases focusing on document image retrieval. A video adaptation is discussed below.
Benchmark Design Principles:
- Document pages are evaluated as images; most tasks avoid reliance on OCR.
- Human-graded multi-level relevance (e.g., 0/1/2), with verification for both retrieval and answer grounding.
- Queries are generated through a hybrid human-in-the-loop and synthetic process (especially in V2–V3), enhancing realism and difficulty.
2. Retrieval Task, Domains, and Query Structures
The ViDoRe retrieval task is: Given a free-form natural language query, rank a corpus of document (or video) images to maximize the relevance in the top-k (k = 5 for V1–V2; k = 10 for V3). Key aspects include:
- Document Domains: Scientific publications, charts, forms, tables, industry reports, regulatory documents, financial/HR/energy/pharma/maintenance records, and ESG reports. Each major release increases the number and diversity of domains (Macé et al., 22 May 2025, Loison et al., 13 Jan 2026, Moreira et al., 3 Feb 2026).
- Multilinguality: Queries and corpora in English, French, and up to six languages for V3. Training data typically monolingual; evaluation multilingual (Moreira et al., 3 Feb 2026).
- Queries: Open-ended (“Summarize the conclusion”), extractive, numerical, multi-hop, compare/contrast, boolean, and enumerative. V2 and V3 introduce blind contextual queries, long queries, and cross-document queries that require reasoning across multiple pages and documents (Macé et al., 22 May 2025, Loison et al., 13 Jan 2026).
Table: ViDoRe Dataset Evolution
| Version | Task Focus | Doc/Page Scale | Multilingual | k | Query Realism | Annotation Type |
|---|---|---|---|---|---|---|
| V1 | In-domain, extractive | ~10K pages | EN | 5 | Moderate | Human relevance only [0,1,2] |
| V2 | Zero-shot, synthetic & multilingual | ~10K pages | EN/FR/ES/DE | 5 | High | Human, multi-step |
| V3 | RAG, multi-type, multilingual | ~26K pages | 6 languages | 10 | Maximum | Human relevance, bbox, answer |
3. Evaluation Metrics and Protocols
ViDoRe exclusively employs graded relevance metrics designed for ranked retrieval:
- nDCG@k (Normalized Discounted Cumulative Gain at k):
where
and is the graded relevance of the item at position , and is the maximal possible DCG.
- Recall@k and mean Average Precision (mAP): Introduced in V3 to capture both coverage and ranking quality.
- Visual Grounding: IoU and Dice/F1 score between predicted and annotated bounding boxes on relevant pages (in V3) (Loison et al., 13 Jan 2026).
- Generation Quality: Exact Match, token-level F1, correctness of answer conditioned on retrieved context (in RAG pipelines) (Loison et al., 13 Jan 2026).
4. Modeling Techniques and State-of-the-Art Results
Success on ViDoRe hinges on a family of architectural, training, and data-centric advances:
- Late-Interaction Models (ColBERT-style): Store token-level embeddings for both query and document, computing relevance via MaxSim (or TopKSim variant). This approach achieves fine-grained matching, particularly for visual elements (e.g., individual table cells) (Xu et al., 7 Jul 2025, Masry et al., 2 Nov 2025, Moreira et al., 3 Feb 2026).
- Bidirectional Attention: Replacing causal self-attention with bidirectional allows all tokens to attend fully, improving semantic alignment (Xu et al., 7 Jul 2025, Moreira et al., 3 Feb 2026).
- Two-Stage and Self-Supervised Training: Initial text-only pretraining (InfoNCE, Masked Contrastive Learning), followed by multimodal alignment on text–image pairs with hard negative mining (Xu et al., 7 Jul 2025, Masry et al., 2 Nov 2025).
- Domain and Data Balance: Uniform sampling across K-means clusters to mitigate domain bias (Moreira et al., 3 Feb 2026).
- Multilingual Augmentation: Automated translation of queries to maximize coverage on multilingual benchmarks (Moreira et al., 3 Feb 2026).
- Model Merging ("model soup"): Weight-averaged fine-tunes boost performance by 0.8–1.5% over best singles (Moreira et al., 3 Feb 2026).
State-of-the-Art Model Results:
| Model (Size) | V1 nDCG@5 | V2 nDCG@5 | V3 nDCG@10 |
|---|---|---|---|
| Nemotron ColEmbed 8B (Moreira et al., 3 Feb 2026) | 84.80 | 84.80 | 63.42 |
| Llama-Nemoretriever 3B (Xu et al., 7 Jul 2025) | 91.0 | 63.5 | — |
| ColMate-Pali-3B (Masry et al., 2 Nov 2025) | — | 57.61 | — |
Editor's term: “Late-interaction” overwhelmingly dominates ViDoRe V2–V3; bi-encoder alternatives require substantial reranking to remain competitive, at significant efficiency benefit but notable accuracy penalty.
5. Storage, Efficiency, and Practical Trade-offs
Late-interaction mechanisms, though state-of-the-art in accuracy, come at a pronounced storage and latency cost due to the need to store and compare large token-level embedding matrices per document:
- Storage: For 1M pages, a full-dimension (4096D) late-interaction model requires ~5.9 TB (8B), while single-vector bi-encoder models need only 3.8 GB (2048D) (Moreira et al., 3 Feb 2026).
- Accuracy-Storage Tradeoff: Dimensionality reduction (e.g., to 512D or 128D) allows storage savings up to 97% at the cost of 4–5% relative quality loss (Moreira et al., 3 Feb 2026, Xu et al., 7 Jul 2025).
- Latency: MaxSim or TopKSim computations require specialized vector DB support; approximate nearest neighbor or quantization (e.g. fp16, int8, late pooling) can modestly reduce inference cost (Masry et al., 2 Nov 2025, Moreira et al., 3 Feb 2026).
- Pipeline Optimization: Hybrid bi-encoder with late-interaction reranking or compact cross-encoders can reach up to ~91% of full performance with storage orders of magnitude lower, at minor latency cost (+2.4 s/query for reranking 25 candidates) (Xu et al., 7 Jul 2025).
6. Video Retrieval Adaptation: VideoEval and ViDoRe
The VideoEval benchmark introduces a "ViDoRe" framework for the video domain, emphasizing efficient, broad-spectrum evaluation of Video Foundation Models (VFMs):
- VidTAB: Few-shot adaptation protocols for 8 classification tasks across 5 domains (action, behavior, moderation, quality, emotion).
- VidEB: Frozen embedding evaluations for fine-grained video retrieval (scene copy, complementary/incident scenes, copy detection).
- Metrics: mAP, micro-AP, recall@K, TA-score (average adaptation accuracy), all consistent with ViDoRe’s methodology (Li et al., 2024).
- Cost Efficiency: ~5,000 training samples and ~20 GPU hours/model (vs. >100 GPU-hrs and hundreds of thousands of samples for prior video benchmarks).
The protocol recycles core ViDoRe principles: uniform evaluation across diverse domains, strict few-shot and zero-shot regimes, standardized metrics, and rigorous, low-cost sampling and modeling recommendations. The "ViDoRe" label is considered synonymous with video domain representation benchmarking in the context of VideoEval (Li et al., 2024).
7. Insights, Limitations, and Future Directions
ViDoRe benchmarks have driven significant advances in both visual document and video retrieval, but current leaderboards highlight substantial open challenges:
- Room for Improvement: Even top late-interaction models achieve <65% nDCG@10 on ViDoRe V3, with particularly low scores on multi-hop, open-ended, and non-textual queries (Loison et al., 13 Jan 2026, Moreira et al., 3 Feb 2026).
- Generalization Gaps: Models that saturate on V1 drop sharply on V2/V3, underscoring persistent limitations in real-world, multilingual, and cross-domain generalization (Macé et al., 22 May 2025, Moreira et al., 3 Feb 2026).
- Modality and Query Complexity: Visual grounding and table/chart-heavy queries remain notably difficult (up to 10 percentage points worse than text-only equivalence) (Loison et al., 13 Jan 2026).
- Community-driven Expansion: ViDoRe is positioned as a “living benchmark,” with open infrastructure for community extension (new datasets/tasks/metrics/modalities) and collaborative leaderboard maintenance (Macé et al., 22 May 2025).
- RAG Pipeline Integration: V3 enables, for the first time, end-to-end evaluation of multimodal retrieval-augmented generation, bridging retrieval, localization, and answer generation in unified benchmarks (Loison et al., 13 Jan 2026).
A plausible implication is that future state-of-the-art will require advances in both modeling (improved multimodal/cross-lingual fusion, enhanced negative mining, and transfer) and system engineering (storage/latency optimization for late-interaction) to meet the real-world expectations embodied in ViDoRe V2/V3 and the VideoEval "ViDoRe" adaptation. The continued progression of ViDoRe benchmarks is expected to define the practical frontier for multimodal retrieval and RAG systems.