VDR-Bench: Visual Document Retrieval Benchmark

Updated 26 March 2026

VDR-Bench is a comprehensive suite of benchmarks for multimodal AI, assessing visual document retrieval, question answering, and domain-specific tasks like vessel draft reading.
It employs advanced techniques such as iterative cropped search, knowledge-graph expansion, and cross-modal embedding to enforce robust multi-hop reasoning.
Rigorous evaluation metrics, including Recall@K, mAP, and MADDE, ensure reproducible performance standards across varied real-world scenarios.

VDR-Bench encompasses a set of visual and multimodal benchmarks targeting advanced evaluation of artificial intelligence systems in visual document retrieval, vision-based question answering, and domain-specific perception applications. Prominent VDR-Bench variants include (1) VDR-Bench for complex multimodal search and reasoning in large-scale natural images, (2) the SDS KoPub VDR benchmark for visually rich Korean public documents, and (3) the VDR-Bench for vessel draft reading in maritime surveillance. Each instantiation establishes rigorous standards for multimodal reasoning, retrieval accuracy, dataset design, and task evaluation across diverse contexts (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025, Qu et al., 2023).

1. Definitions and Scope

VDR-Bench (Visual Document/DeepResearch/Vessel Draft Reading Benchmark, context-dependent) refers to benchmark datasets and evaluation protocols designed to measure AI system performance on tasks that require integrated visual understanding, search, and reasoning. Core scenarios include:

Retrieval of document pages with complex visual structure (tables, charts, multi-column layouts), often with cross-modal queries.
Vision-DeepResearch tasks involving iterative, entity-centric visual search and multi-hop question answering in web-scale image and text domains.
Specialized industrial perception tasks, such as vessel draft reading, demanding robust detection, segmentation, recognition, and measurement under challenging environmental conditions.

VDR-Bench datasets are constructed to eliminate shortcuts (language priors, perfect image matching), enforce non-trivial visual and cross-modal reasoning, and set reproducible protocols for empirical research (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025, Qu et al., 2023).

2. Dataset Construction and Curation Protocols

This 2,000-instance VQA benchmark consists of questions requiring multi-stage, entity-focused visual search. Key curation steps include:

Multi-domain pre-filtering: Images spanning domains such as Game (15.65%), Sports (15.35%), Science & Tech (12.10%), Art & Music (10.95%), etc.
Manual cropping: Human annotators select visually salient entities; bounding boxes ("crops") drive downstream search.
Visual/entity verification: Candidate entities extracted from retrieved results, confirmed by both MLLMs (Qwen3-VL) and human validation, ruling out shortcut solutions via full-image search.
Knowledge-graph expansion: Each entity is programmatically linked to knowledge-graph nodes for multi-hop reasoning beyond direct recognition.
Solvability/gold-answer verification: Only instances solvable by explicit search/reasoning chains are retained.

361 public Korean documents (40,781 pages) selected from government/legal data, balanced across six public domains: society, environment, education, industry, diplomacy, and finance.
600 high-quality query–page–answer triples are generated from evidence on individual page images using multimodal LLMs and then refined/verified by experts.
Queries annotated for modality: text-only (17%), visual-based (27%), or cross-modal (56%), enabling granular evaluation by reasoning type.

Lock-mounted camera images of river vessels; draft-mark detection (2,226 images), draft-scale recognition & segmentation (1,198 patches), held-out final depth estimation (45 images).
Exhaustive manual annotation of bounding boxes (draft marks/characters), pixel-level segmentation (water/background), and scalar ground-truth depths.
Dataset split into train/val/test partitions and augmented for illumination and appearance diversity.

3. Task Formulations and Evaluation Metrics

3.1 Vision-DeepResearch Search and Reasoning

Multi-round cropped search: Iteratively select regions of interest, submit as search queries, aggregate entities and context, and compose multi-hop answers. Pseudocode and mathematical formulations define the process, emphasizing entity recall at each round:

$B_t = \underset{B \subset I}{\mathrm{argmax}}\, f(B; \text{RetrievedEntities})$

with stopping via an LLM-based judge when required entities are retrieved (Zeng et al., 2 Feb 2026).

Evaluation metrics:
- Answer accuracy: LLM-judged match to gold answers (Tongyi DeepResearch judge).
- Entity recall (ER): Proportion of VQA instances where system’s retrieved entities, aggregated per round, achieve full coverage against gold entity sets.

3.2 Visual Document Retrieval

Text-only retrieval: Embeddings from PDF-parsed text, with retrieval via nearest-neighbor search in the textual embedding space.
Multimodal retrieval: Joint visual+textual embeddings via vision-language encoders, with similarity computed in the joint space.
Evaluation metrics:
- Recall@ $K$ :
$\mathrm{Recall}@K = \frac{1}{N}\sum_{i=1}^N \mathbb{I}\{r_i \leq K\}$ - Mean Reciprocal Rank (MRR), mean Average Precision (mAP).

3.3 Vessel Draft Reading

End-to-end depth estimation: Detection, recognition, segmentation, and geometric computation of draft depth from images.
Evaluation metrics:
- [email protected] for detection, mean absolute draft-depth error (MADDE), mean absolute vertical distance of waterline (MAVD).

4. Baseline Performance and Empirical Findings

Answer accuracy improves consistently with the multi-round cropped-search workflow (MVF), with increments of 8–15 absolute percentage points compared to direct answer or CIS+TS (crop-image search plus text search). ER rises in parallel (e.g., Gemini-2.5-Pro: 13%→32%).
All leading MLLMs (Gemini-2.5-Pro, GPT-5, Qwen3-VL variants) show comparable improvement upon integrating MVF with systematic cropping and search interleaving.

Text-only retrieval: Best baseline Recall@3 is 0.77 (SDS-Multimodal-Embedding-7B), surpassing BGE-M3 by +13 percentage points.
Multimodal retrieval: Best Recall@5 is 0.90, with Recall@10 reaching 0.95 (SDS-Multimodal-Embedding-7B), marking an 8.4% lead over text-only.
Visual/cross-modal queries: Largest absolute performance gap when moving to multimodal retrieval, with Recall@3 for Visual queries jumping by 28 points.
Domain sensitivity: Largest improvements observed for finance, diplomacy, and environment, especially where table/chart interpretation is critical.

MTL-VDR (multi-task shared backbone): lowest MADDE (0.074 ± 0.076 m), highest FPS (> 60), and smallest waterline detection error (MAVD = 3.32 ± 5.68 px).
Outperforms U-Net, PSPNet, HRNet, DeepLabV3+, and other segmentation baselines. Robust even under severe stains, with minor degradation in error.

5. Key Methodological Insights

Eliminating shortcut solutions: VDR-Bench datasets are explicitly curated to prevent shortcutting via simple text cues or direct image matching, enforcing true multimodal reasoning (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025).
Multi-hop reasoning via knowledge-graph expansion ensures that VQA tasks extend beyond simple classification/recognition, demanding longer reasoning chains.
Cross-modal embedding and indexing (for documents) are essential for handling rich page structure and achieving state-of-the-art multimodal retrieval.
Multi-task feature sharing (as in vessel draft reading) yields dramatic gains in accuracy and efficiency by exploiting shared spatial representations.
LLM-as-judge approaches in entity recall and answer accuracy are critical for evaluation beyond string match.

6. Challenges and Emerging Directions

Complex layout and visual structure: Highly structured documents with non-standard layouts continue to challenge both OCR and vision-LLMs, especially for table joins, small legends, and nested columns (Lee et al., 7 Nov 2025).
Iterative search planning: Lazy search and over-reliance on model priors necessitate explicit workflow designs, such as the MVF cropping strategy, to reveal models’ actual retrieval and reasoning capacity (Zeng et al., 2 Feb 2026).
Segmentation and recognition under adverse conditions: Marine applications expose robustness gaps, specifically for heavily stained marks and variable lighting (Qu et al., 2023).
Scaling to cross-page/multi-document retrieval: Real-world policy analysis and research use-cases call for expansion to multi-hop, multi-document reasoning and diversified user queries (Lee et al., 7 Nov 2025).
Leaderboard standardization and protocol openness: Public platforms with consistent metrics (Recall@K, MRR, mAP, ER) are recommended to facilitate community benchmarking and progress (Lee et al., 7 Nov 2025, Zeng et al., 2 Feb 2026).

7. Impact and Research Implications

VDR-Bench, in its multiple instantiations, catalyzes progress in multimodal search, perception, and reasoning by setting high empirical standards for both dataset curation and evaluation. Its design philosophies—entity-centric search, multi-hop reasoning, modality-specific performance disaggregation, and explicit adversarial curation—represent model benchmarks for academic and applied multimodal AI research. The systematic separation of text-only, visual-only, and cross-modal reasoning axes enables detailed model diagnostics, while the highly realistic test conditions ensure transferability to real-world applications (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025, Qu et al., 2023). A plausible implication is that as VDR-Bench expands (especially with multi-hop, multi-document and authentic user queries), it will continue to shape both architectures and evaluation paradigms for multimodal retrieval and reasoning systems.

Markdown Report Issue Upgrade to Chat

References (3)

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models (2026)

SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents (2025)

Multi-Task Learning-Enabled Automatic Vessel Draft Reading for Intelligent Maritime Surveillance (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VDR-Bench.

VDR-Bench: Visual Document Retrieval Benchmark

1. Definitions and Scope

2. Dataset Construction and Curation Protocols

2.1 Vision-DeepResearch (VDR-Bench) (Zeng et al., 2 Feb 2026)

2.2 Visual Document Retrieval (SDS KoPub VDR) (Lee et al., 7 Nov 2025)

2.3 Vessel Draft Reading VDR-Bench (Qu et al., 2023)

3. Task Formulations and Evaluation Metrics

3.1 Vision-DeepResearch Search and Reasoning

3.2 Visual Document Retrieval

3.3 Vessel Draft Reading

4. Baseline Performance and Empirical Findings

4.1 Vision-DeepResearch (Zeng et al., 2 Feb 2026)

4.2 SDS KoPub VDR (Lee et al., 7 Nov 2025)

4.3 Vessel Draft Reading (Qu et al., 2023)

5. Key Methodological Insights

6. Challenges and Emerging Directions

7. Impact and Research Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VDR-Bench: Visual Document Retrieval Benchmark

1. Definitions and Scope

2. Dataset Construction and Curation Protocols

2.1 Vision-DeepResearch (VDR-Bench) (Zeng et al., 2 Feb 2026)

2.2 Visual Document Retrieval (SDS KoPub VDR) (Lee et al., 7 Nov 2025)

2.3 Vessel Draft Reading VDR-Bench (Qu et al., 2023)

3. Task Formulations and Evaluation Metrics

3.1 Vision-DeepResearch Search and Reasoning

3.2 Visual Document Retrieval

3.3 Vessel Draft Reading

4. Baseline Performance and Empirical Findings

4.1 Vision-DeepResearch (Zeng et al., 2 Feb 2026)

4.2 SDS KoPub VDR (Lee et al., 7 Nov 2025)

4.3 Vessel Draft Reading (Qu et al., 2023)

5. Key Methodological Insights

6. Challenges and Emerging Directions

7. Impact and Research Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research