BBox DocVQA: Spatially Grounded DocVQA
- The paper introduces BBox DocVQA, a dataset with explicit bounding box annotations to enable fine-grained spatial grounding in DocVQA.
- It employs a robust segment–judge–and–generate pipeline using SAM and GPT-5 to produce high-fidelity, spatially consistent QA pairs, complemented by human verification.
- Benchmarking reveals a notable reasoning versus localization gap, underscoring the need for modular, grounding-aware models in document AI.
Bounding Box DocVQA is a domain-specific dataset and evaluation framework designed to advance document visual question answering (DocVQA) through rigorous spatial grounding via explicit bounding box annotations. Developed to address the interpretability and fine-grained spatial reasoning deficiencies in prior DocVQA resources, BBox DocVQA operationalizes the alignment of vision-LLM (VLM) answers to localized document evidence and enables systematic benchmarking of both reasoning and grounding capabilities (Yu et al., 19 Nov 2025).
1. Dataset Structure and Annotation Schema
BBox DocVQA comprises a large-scale training corpus and a human-curated benchmark, each explicitly encoding spatial context through bounding boxes. The training set covers 3,671 academic papers (42,380 pages), yielding 30,780 automatically generated question-answer (QA) pairs. The benchmark set consists of 80 papers (1,941 pages) and 1,623 QA pairs, with each sample manually verified for spatial and semantic fidelity.
Questions are exhaustively annotated in three formats:
- Single-Page Single-BBox (SPSBB)
- Single-Page Multi-BBox (SPMBB)
- Multi-Page Multi-BBox (MPMBB)
Region types per QA include text (49.9% of benchmark), image (36.7%), and table (13.4%). Each QA instance is linked to one or more axis-aligned bounding boxes in pixel coordinates relative to the top-left corner of the page. Multi-region QAs (SPMBB, MPMBB) annotate bounding boxes as lists of lists, supporting complex compositional evidence.
Distributional breakdowns: | Format | Benchmark Count (% share) | Training Count (% share) | |--------|--------------------------|--------------------------| | SPSBB | 749 (46.2%) | 11,671 (37.9%) | | SPMBB | 556 (34.3%) | 7,510 (24.4%) | | MPMBB | 318 (19.6%) | 11,599 (37.7%) |
This schema supports fine-grained spatial evaluation and enables disambiguation of answer provenance—critical for interpretable document AI.
2. Automated Construction: Segment–Judge–and–Generate Pipeline
BBox DocVQA employs a multi-stage automated pipeline to generate spatially grounded QA pairs with high fidelity:
- Region Segmentation: Utilizes the off-the-shelf Segment Anything Model (SAM, ViT-H variant, checkpoint sam_vit_h.pth) to generate candidate masks on each 300 DPI page, converting masks to rectangles and filtering by area ratio , with valid regions . Each region is padded by 10 px subject to image boundaries.
- Semantic Judgment: Qwen2.5-VL-72B model is prompted to classify regions as text, table, or image, and endorse those meeting holistic criteria (≥30% content, no fragmented blocks, etc.). Redundant overlapping boxes () are resolved by type-specific rules (smaller for text, larger for image/table).
- QA Generation: GPT-5 generates questions and answers strictly grounded in selected crops. Sampling strategy ensures balanced coverage across SPSBB/SPMBB/MPMBB and upweights table/image scenarios.
- Human Verification: Benchmark pages are manually cropped by two experts with adjudication by a third. QA pairs are further reviewed for logic, factuality, and difficulty, achieving near-perfect annotation consistency.
This pipeline provides both scalability (30K QA automatically generated) and reliability (1.6K QA human-verified) for the two main dataset splits.
3. Spatial Grounding and Evaluation Protocols
BBox DocVQA establishes stringent evaluation protocols for answer correctness and evidence localization:
- Bounding Box Representation: Axis-aligned rectangle with , .
- Intersection-over-Union (IoU): Quantifies spatial alignment:
For multi-box scenarios, IoU is calculated per ground-truth region and aggregated.
- Reasoning Correctness: Binary scoring via DeepSeek-v3.1 LLM, rewarding semantically correct answers irrespective of exact string match.
This dual-metric paradigm allows uncoupling of answer quality and localization fidelity, supporting nuanced error analysis and progress tracking.
4. Model Benchmarking and Quantitative Results
BBox DocVQA benchmarks major VLMs—Qwen2.5-VL (3B–72B), Qwen3VL, InternVL3, GPT-5—across both spatial grounding and reasoning tasks. Core results (on the 1,623-sample benchmark):
Spatial Grounding (Mean IoU % and Good Ratio)
| Model | Good % | Mean IoU | SPSBB | SPMBB | MPMBB |
|---|---|---|---|---|---|
| Qwen2.5-3B | 51.4 | 4.7 | 3.8 | 5.6 | 4.9 |
| Qwen2.5-72B | 99.0 | 35.2 | 40.1 | 33.2 | 27.2 |
| Qwen3VL-32B | 94.5 | 20.4 | 22.6 | 17.3 | 21.0 |
| InternVL3-2B | 80.9 | 0.1 | 0.0 | 0.2 | 0.1 |
| GPT-5 | 99.9 | 0.9 | 0.1 | 1.6 | 1.2 |
Unified Answering Accuracy (% correct, pages + bbox → answer)
| Model | Mean Acc. | SPSBB | SPMBB | MPMBB |
|---|---|---|---|---|
| Qwen2.5-3B | 30.9 | 31.4 | 33.3 | 25.5 |
| Qwen2.5-72B | 68.6 | 71.0 | 71.6 | 57.9 |
| Qwen3VL-32B | 77.1 | 81.0 | 84.4 | 55.4 |
| InternVL3-8B | 50.5 | 53.1 | 53.1 | 39.6 |
| GPT-5 | 81.5 | 82.6 | 83.6 | 74.8 |
Providing ground-truth regions yields up to +25 pp accuracy gains for smaller models, establishing the impact of precise evidence cropping on reasoning.
5. Error Analysis, Architectural Insights and Future Directions
Persistent challenges highlighted by benchmarking include:
- Spatial Grounding Deficits: State-of-the-art VLMs achieve <40% mean IoU; many models, including GPT-5, exhibit coordinate insensitivity, likely due to internal resizing and lack of explicit spatial supervision.
- Reasoning vs. Localization Gap: High answer accuracy from full-page context often occurs without correct evidence grounding, indicating a disconnect between reasoning and visual alignment.
- Error Sources: Misgrounding amplifies in multi-column layouts, overlapping regions, and multi-page composition (MPMBB), with models frequently confusing dense layouts or failing to associate answer regions across pages.
- Suggested Solutions: Two-stage pipelines (region proposal followed by reasoning), supervised grounding modules trained on BBox DocVQA, prompt engineering for coordinate consistency, and explainable multi-step reasoning (“where” and “why” tracing) are proposed directions.
A plausible implication is that substantial structural innovations—such as modular agentic frameworks, region-aware tokenization, or direct spatial alignment supervision—may be required to bridge the reasoning-localization gap.
6. Impact and Relation to Broader DocVQA Landscape
BBox DocVQA sets a new reference standard for interpretable DocVQA with explicit spatial grounding, enabling both granular evaluation and robust training for evidence localization. The dataset’s scale and verification rigor facilitate benchmarking and development of grounding-aware architectures. By uncovering persistent deficiencies in model localization, BBox DocVQA motivates new research in agentic reasoning (e.g., ARIAL (Mohammadshirazi et al., 22 Nov 2025)), spatially grounded explanations (EaGERS (Lagos et al., 15 Jul 2025)), unified spatial QA frameworks (BoundingDocs (Giovannini et al., 6 Jan 2025)), and adversarial robustness in OCR-driven document AI (Tien et al., 19 Jun 2025).
By making all data and code public, BBox DocVQA is positioned to catalyze progress in fine-grained, trustworthy multimodal document understanding.