VDR-Bench: Visual Document Retrieval Benchmark
- VDR-Bench is a comprehensive suite of benchmarks for multimodal AI, assessing visual document retrieval, question answering, and domain-specific tasks like vessel draft reading.
- It employs advanced techniques such as iterative cropped search, knowledge-graph expansion, and cross-modal embedding to enforce robust multi-hop reasoning.
- Rigorous evaluation metrics, including Recall@K, mAP, and MADDE, ensure reproducible performance standards across varied real-world scenarios.
VDR-Bench encompasses a set of visual and multimodal benchmarks targeting advanced evaluation of artificial intelligence systems in visual document retrieval, vision-based question answering, and domain-specific perception applications. Prominent VDR-Bench variants include (1) VDR-Bench for complex multimodal search and reasoning in large-scale natural images, (2) the SDS KoPub VDR benchmark for visually rich Korean public documents, and (3) the VDR-Bench for vessel draft reading in maritime surveillance. Each instantiation establishes rigorous standards for multimodal reasoning, retrieval accuracy, dataset design, and task evaluation across diverse contexts (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025, Qu et al., 2023).
1. Definitions and Scope
VDR-Bench (Visual Document/DeepResearch/Vessel Draft Reading Benchmark, context-dependent) refers to benchmark datasets and evaluation protocols designed to measure AI system performance on tasks that require integrated visual understanding, search, and reasoning. Core scenarios include:
- Retrieval of document pages with complex visual structure (tables, charts, multi-column layouts), often with cross-modal queries.
- Vision-DeepResearch tasks involving iterative, entity-centric visual search and multi-hop question answering in web-scale image and text domains.
- Specialized industrial perception tasks, such as vessel draft reading, demanding robust detection, segmentation, recognition, and measurement under challenging environmental conditions.
VDR-Bench datasets are constructed to eliminate shortcuts (language priors, perfect image matching), enforce non-trivial visual and cross-modal reasoning, and set reproducible protocols for empirical research (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025, Qu et al., 2023).
2. Dataset Construction and Curation Protocols
2.1 Vision-DeepResearch (VDR-Bench) (Zeng et al., 2 Feb 2026)
This 2,000-instance VQA benchmark consists of questions requiring multi-stage, entity-focused visual search. Key curation steps include:
- Multi-domain pre-filtering: Images spanning domains such as Game (15.65%), Sports (15.35%), Science & Tech (12.10%), Art & Music (10.95%), etc.
- Manual cropping: Human annotators select visually salient entities; bounding boxes ("crops") drive downstream search.
- Visual/entity verification: Candidate entities extracted from retrieved results, confirmed by both MLLMs (Qwen3-VL) and human validation, ruling out shortcut solutions via full-image search.
- Knowledge-graph expansion: Each entity is programmatically linked to knowledge-graph nodes for multi-hop reasoning beyond direct recognition.
- Solvability/gold-answer verification: Only instances solvable by explicit search/reasoning chains are retained.
2.2 Visual Document Retrieval (SDS KoPub VDR) (Lee et al., 7 Nov 2025)
- 361 public Korean documents (40,781 pages) selected from government/legal data, balanced across six public domains: society, environment, education, industry, diplomacy, and finance.
- 600 high-quality query–page–answer triples are generated from evidence on individual page images using multimodal LLMs and then refined/verified by experts.
- Queries annotated for modality: text-only (17%), visual-based (27%), or cross-modal (56%), enabling granular evaluation by reasoning type.
2.3 Vessel Draft Reading VDR-Bench (Qu et al., 2023)
- Lock-mounted camera images of river vessels; draft-mark detection (2,226 images), draft-scale recognition & segmentation (1,198 patches), held-out final depth estimation (45 images).
- Exhaustive manual annotation of bounding boxes (draft marks/characters), pixel-level segmentation (water/background), and scalar ground-truth depths.
- Dataset split into train/val/test partitions and augmented for illumination and appearance diversity.
3. Task Formulations and Evaluation Metrics
3.1 Vision-DeepResearch Search and Reasoning
- Multi-round cropped search: Iteratively select regions of interest, submit as search queries, aggregate entities and context, and compose multi-hop answers. Pseudocode and mathematical formulations define the process, emphasizing entity recall at each round:
with stopping via an LLM-based judge when required entities are retrieved (Zeng et al., 2 Feb 2026).
- Evaluation metrics:
- Answer accuracy: LLM-judged match to gold answers (Tongyi DeepResearch judge).
- Entity recall (ER): Proportion of VQA instances where system’s retrieved entities, aggregated per round, achieve full coverage against gold entity sets.
3.2 Visual Document Retrieval
- Text-only retrieval: Embeddings from PDF-parsed text, with retrieval via nearest-neighbor search in the textual embedding space.
- Multimodal retrieval: Joint visual+textual embeddings via vision-language encoders, with similarity computed in the joint space.
- Evaluation metrics:
- Recall@:
- Mean Reciprocal Rank (MRR), mean Average Precision (mAP).
3.3 Vessel Draft Reading
End-to-end depth estimation: Detection, recognition, segmentation, and geometric computation of draft depth from images.
Evaluation metrics:
- [email protected] for detection, mean absolute draft-depth error (MADDE), mean absolute vertical distance of waterline (MAVD).
4. Baseline Performance and Empirical Findings
4.1 Vision-DeepResearch (Zeng et al., 2 Feb 2026)
- Answer accuracy improves consistently with the multi-round cropped-search workflow (MVF), with increments of 8–15 absolute percentage points compared to direct answer or CIS+TS (crop-image search plus text search). ER rises in parallel (e.g., Gemini-2.5-Pro: 13%→32%).
- All leading MLLMs (Gemini-2.5-Pro, GPT-5, Qwen3-VL variants) show comparable improvement upon integrating MVF with systematic cropping and search interleaving.
4.2 SDS KoPub VDR (Lee et al., 7 Nov 2025)
- Text-only retrieval: Best baseline Recall@3 is 0.77 (SDS-Multimodal-Embedding-7B), surpassing BGE-M3 by +13 percentage points.
- Multimodal retrieval: Best Recall@5 is 0.90, with Recall@10 reaching 0.95 (SDS-Multimodal-Embedding-7B), marking an 8.4% lead over text-only.
- Visual/cross-modal queries: Largest absolute performance gap when moving to multimodal retrieval, with Recall@3 for Visual queries jumping by 28 points.
- Domain sensitivity: Largest improvements observed for finance, diplomacy, and environment, especially where table/chart interpretation is critical.
4.3 Vessel Draft Reading (Qu et al., 2023)
- MTL-VDR (multi-task shared backbone): lowest MADDE (0.074 ± 0.076 m), highest FPS (> 60), and smallest waterline detection error (MAVD = 3.32 ± 5.68 px).
- Outperforms U-Net, PSPNet, HRNet, DeepLabV3+, and other segmentation baselines. Robust even under severe stains, with minor degradation in error.
5. Key Methodological Insights
- Eliminating shortcut solutions: VDR-Bench datasets are explicitly curated to prevent shortcutting via simple text cues or direct image matching, enforcing true multimodal reasoning (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025).
- Multi-hop reasoning via knowledge-graph expansion ensures that VQA tasks extend beyond simple classification/recognition, demanding longer reasoning chains.
- Cross-modal embedding and indexing (for documents) are essential for handling rich page structure and achieving state-of-the-art multimodal retrieval.
- Multi-task feature sharing (as in vessel draft reading) yields dramatic gains in accuracy and efficiency by exploiting shared spatial representations.
- LLM-as-judge approaches in entity recall and answer accuracy are critical for evaluation beyond string match.
6. Challenges and Emerging Directions
- Complex layout and visual structure: Highly structured documents with non-standard layouts continue to challenge both OCR and vision-LLMs, especially for table joins, small legends, and nested columns (Lee et al., 7 Nov 2025).
- Iterative search planning: Lazy search and over-reliance on model priors necessitate explicit workflow designs, such as the MVF cropping strategy, to reveal models’ actual retrieval and reasoning capacity (Zeng et al., 2 Feb 2026).
- Segmentation and recognition under adverse conditions: Marine applications expose robustness gaps, specifically for heavily stained marks and variable lighting (Qu et al., 2023).
- Scaling to cross-page/multi-document retrieval: Real-world policy analysis and research use-cases call for expansion to multi-hop, multi-document reasoning and diversified user queries (Lee et al., 7 Nov 2025).
- Leaderboard standardization and protocol openness: Public platforms with consistent metrics (Recall@K, MRR, mAP, ER) are recommended to facilitate community benchmarking and progress (Lee et al., 7 Nov 2025, Zeng et al., 2 Feb 2026).
7. Impact and Research Implications
VDR-Bench, in its multiple instantiations, catalyzes progress in multimodal search, perception, and reasoning by setting high empirical standards for both dataset curation and evaluation. Its design philosophies—entity-centric search, multi-hop reasoning, modality-specific performance disaggregation, and explicit adversarial curation—represent model benchmarks for academic and applied multimodal AI research. The systematic separation of text-only, visual-only, and cross-modal reasoning axes enables detailed model diagnostics, while the highly realistic test conditions ensure transferability to real-world applications (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025, Qu et al., 2023). A plausible implication is that as VDR-Bench expands (especially with multi-hop, multi-document and authentic user queries), it will continue to shape both architectures and evaluation paradigms for multimodal retrieval and reasoning systems.