Papers
Topics
Authors
Recent
Search
2000 character limit reached

VDR-Bench: Visual Document Retrieval Benchmark

Updated 26 March 2026
  • VDR-Bench is a comprehensive suite of benchmarks for multimodal AI, assessing visual document retrieval, question answering, and domain-specific tasks like vessel draft reading.
  • It employs advanced techniques such as iterative cropped search, knowledge-graph expansion, and cross-modal embedding to enforce robust multi-hop reasoning.
  • Rigorous evaluation metrics, including Recall@K, mAP, and MADDE, ensure reproducible performance standards across varied real-world scenarios.

VDR-Bench encompasses a set of visual and multimodal benchmarks targeting advanced evaluation of artificial intelligence systems in visual document retrieval, vision-based question answering, and domain-specific perception applications. Prominent VDR-Bench variants include (1) VDR-Bench for complex multimodal search and reasoning in large-scale natural images, (2) the SDS KoPub VDR benchmark for visually rich Korean public documents, and (3) the VDR-Bench for vessel draft reading in maritime surveillance. Each instantiation establishes rigorous standards for multimodal reasoning, retrieval accuracy, dataset design, and task evaluation across diverse contexts (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025, Qu et al., 2023).

1. Definitions and Scope

VDR-Bench (Visual Document/DeepResearch/Vessel Draft Reading Benchmark, context-dependent) refers to benchmark datasets and evaluation protocols designed to measure AI system performance on tasks that require integrated visual understanding, search, and reasoning. Core scenarios include:

  • Retrieval of document pages with complex visual structure (tables, charts, multi-column layouts), often with cross-modal queries.
  • Vision-DeepResearch tasks involving iterative, entity-centric visual search and multi-hop question answering in web-scale image and text domains.
  • Specialized industrial perception tasks, such as vessel draft reading, demanding robust detection, segmentation, recognition, and measurement under challenging environmental conditions.

VDR-Bench datasets are constructed to eliminate shortcuts (language priors, perfect image matching), enforce non-trivial visual and cross-modal reasoning, and set reproducible protocols for empirical research (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025, Qu et al., 2023).

2. Dataset Construction and Curation Protocols

This 2,000-instance VQA benchmark consists of questions requiring multi-stage, entity-focused visual search. Key curation steps include:

  • Multi-domain pre-filtering: Images spanning domains such as Game (15.65%), Sports (15.35%), Science & Tech (12.10%), Art & Music (10.95%), etc.
  • Manual cropping: Human annotators select visually salient entities; bounding boxes ("crops") drive downstream search.
  • Visual/entity verification: Candidate entities extracted from retrieved results, confirmed by both MLLMs (Qwen3-VL) and human validation, ruling out shortcut solutions via full-image search.
  • Knowledge-graph expansion: Each entity is programmatically linked to knowledge-graph nodes for multi-hop reasoning beyond direct recognition.
  • Solvability/gold-answer verification: Only instances solvable by explicit search/reasoning chains are retained.
  • 361 public Korean documents (40,781 pages) selected from government/legal data, balanced across six public domains: society, environment, education, industry, diplomacy, and finance.
  • 600 high-quality query–page–answer triples are generated from evidence on individual page images using multimodal LLMs and then refined/verified by experts.
  • Queries annotated for modality: text-only (17%), visual-based (27%), or cross-modal (56%), enabling granular evaluation by reasoning type.
  • Lock-mounted camera images of river vessels; draft-mark detection (2,226 images), draft-scale recognition & segmentation (1,198 patches), held-out final depth estimation (45 images).
  • Exhaustive manual annotation of bounding boxes (draft marks/characters), pixel-level segmentation (water/background), and scalar ground-truth depths.
  • Dataset split into train/val/test partitions and augmented for illumination and appearance diversity.

3. Task Formulations and Evaluation Metrics

3.1 Vision-DeepResearch Search and Reasoning

  • Multi-round cropped search: Iteratively select regions of interest, submit as search queries, aggregate entities and context, and compose multi-hop answers. Pseudocode and mathematical formulations define the process, emphasizing entity recall at each round:

Bt=argmaxB⊂I f(B;RetrievedEntities)B_t = \underset{B \subset I}{\mathrm{argmax}}\, f(B; \text{RetrievedEntities})

with stopping via an LLM-based judge when required entities are retrieved (Zeng et al., 2 Feb 2026).

  • Evaluation metrics:
    • Answer accuracy: LLM-judged match to gold answers (Tongyi DeepResearch judge).
    • Entity recall (ER): Proportion of VQA instances where system’s retrieved entities, aggregated per round, achieve full coverage against gold entity sets.

3.2 Visual Document Retrieval

  • Text-only retrieval: Embeddings from PDF-parsed text, with retrieval via nearest-neighbor search in the textual embedding space.
  • Multimodal retrieval: Joint visual+textual embeddings via vision-language encoders, with similarity computed in the joint space.
  • Evaluation metrics:
    • Recall@KK:

    Recall@K=1N∑i=1NI{ri≤K}\mathrm{Recall}@K = \frac{1}{N}\sum_{i=1}^N \mathbb{I}\{r_i \leq K\} - Mean Reciprocal Rank (MRR), mean Average Precision (mAP).

3.3 Vessel Draft Reading

  • End-to-end depth estimation: Detection, recognition, segmentation, and geometric computation of draft depth from images.

  • Evaluation metrics:

    • [email protected] for detection, mean absolute draft-depth error (MADDE), mean absolute vertical distance of waterline (MAVD).

4. Baseline Performance and Empirical Findings

  • Answer accuracy improves consistently with the multi-round cropped-search workflow (MVF), with increments of 8–15 absolute percentage points compared to direct answer or CIS+TS (crop-image search plus text search). ER rises in parallel (e.g., Gemini-2.5-Pro: 13%→32%).
  • All leading MLLMs (Gemini-2.5-Pro, GPT-5, Qwen3-VL variants) show comparable improvement upon integrating MVF with systematic cropping and search interleaving.
  • Text-only retrieval: Best baseline Recall@3 is 0.77 (SDS-Multimodal-Embedding-7B), surpassing BGE-M3 by +13 percentage points.
  • Multimodal retrieval: Best Recall@5 is 0.90, with Recall@10 reaching 0.95 (SDS-Multimodal-Embedding-7B), marking an 8.4% lead over text-only.
  • Visual/cross-modal queries: Largest absolute performance gap when moving to multimodal retrieval, with Recall@3 for Visual queries jumping by 28 points.
  • Domain sensitivity: Largest improvements observed for finance, diplomacy, and environment, especially where table/chart interpretation is critical.
  • MTL-VDR (multi-task shared backbone): lowest MADDE (0.074 ± 0.076 m), highest FPS (> 60), and smallest waterline detection error (MAVD = 3.32 ± 5.68 px).
  • Outperforms U-Net, PSPNet, HRNet, DeepLabV3+, and other segmentation baselines. Robust even under severe stains, with minor degradation in error.

5. Key Methodological Insights

  • Eliminating shortcut solutions: VDR-Bench datasets are explicitly curated to prevent shortcutting via simple text cues or direct image matching, enforcing true multimodal reasoning (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025).
  • Multi-hop reasoning via knowledge-graph expansion ensures that VQA tasks extend beyond simple classification/recognition, demanding longer reasoning chains.
  • Cross-modal embedding and indexing (for documents) are essential for handling rich page structure and achieving state-of-the-art multimodal retrieval.
  • Multi-task feature sharing (as in vessel draft reading) yields dramatic gains in accuracy and efficiency by exploiting shared spatial representations.
  • LLM-as-judge approaches in entity recall and answer accuracy are critical for evaluation beyond string match.

6. Challenges and Emerging Directions

  • Complex layout and visual structure: Highly structured documents with non-standard layouts continue to challenge both OCR and vision-LLMs, especially for table joins, small legends, and nested columns (Lee et al., 7 Nov 2025).
  • Iterative search planning: Lazy search and over-reliance on model priors necessitate explicit workflow designs, such as the MVF cropping strategy, to reveal models’ actual retrieval and reasoning capacity (Zeng et al., 2 Feb 2026).
  • Segmentation and recognition under adverse conditions: Marine applications expose robustness gaps, specifically for heavily stained marks and variable lighting (Qu et al., 2023).
  • Scaling to cross-page/multi-document retrieval: Real-world policy analysis and research use-cases call for expansion to multi-hop, multi-document reasoning and diversified user queries (Lee et al., 7 Nov 2025).
  • Leaderboard standardization and protocol openness: Public platforms with consistent metrics (Recall@K, MRR, mAP, ER) are recommended to facilitate community benchmarking and progress (Lee et al., 7 Nov 2025, Zeng et al., 2 Feb 2026).

7. Impact and Research Implications

VDR-Bench, in its multiple instantiations, catalyzes progress in multimodal search, perception, and reasoning by setting high empirical standards for both dataset curation and evaluation. Its design philosophies—entity-centric search, multi-hop reasoning, modality-specific performance disaggregation, and explicit adversarial curation—represent model benchmarks for academic and applied multimodal AI research. The systematic separation of text-only, visual-only, and cross-modal reasoning axes enables detailed model diagnostics, while the highly realistic test conditions ensure transferability to real-world applications (Zeng et al., 2 Feb 2026, Lee et al., 7 Nov 2025, Qu et al., 2023). A plausible implication is that as VDR-Bench expands (especially with multi-hop, multi-document and authentic user queries), it will continue to shape both architectures and evaluation paradigms for multimodal retrieval and reasoning systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VDR-Bench.