Drawing-Grounded Document QA

Updated 31 March 2026

Drawing-grounded Document QA is a multimodal approach that integrates spatial and semantic reasoning to extract answers from diagrams, schematics, and technical drawings.
State-of-the-art systems combine vision-language models, region extraction, and modular indexing to localize responses with interpretable bounding box evidence.
Empirical evaluations on benchmarks like BoundingDocs and BBox DocVQA reveal high text accuracy yet underscore spatial grounding as a critical challenge for future advancements.

Drawing-grounded Document Question Answering (QA) refers to machine question answering on documents where information required to answer a query is derivable only via direct visual inspection of figures, schematics, or drawings—often necessitating spatial grounding of the answer in the image. Unlike standard document QA, the drawing-grounded setting emphasizes localizing evidence in non-textual modalities, integrating spatial and semantic understanding, and providing interpretable (often bounding box-level) justification. This paradigm has emerged due to the prevalence of visually dense engineering documents, multi-modal forms, and scientific materials where facts are encoded in diagrams, schematics, or tables rather than running text. State-of-the-art pipelines combine vision-LLMs (VLMs), specialized indexing/retrieval, and multi-granular region grounding techniques, with benchmarks now systematically annotated for answer localization.

1. Task Formulation and Dataset Landscape

Drawing-grounded Document QA tasks extend standard document VQA by requiring models to identify not only the textual answer to a question about a document image $I$ , but also its spatial location—typically as bounding boxes—within $I$ (Giovannini et al., 6 Jan 2025, Yu et al., 19 Nov 2025). A typical annotation thus comprises the question $q$ , the answer $v$ , and $b^*$ , a set of bounding boxes corresponding to the evidence region(s). The complexity arises when questions reference device numbers, circuit paths, chart legends, or features in engineering drawings not accessible to plain OCR or text-only processing.

Four major dataset initiatives underpin evaluation and model development:

BoundingDocs: Unifies 10 public Document-AI datasets under a QA-style paradigm with token-level OCR and normalized bounding-box localization, yielding 48,151 documents and 249,016 QA pairs annotated for both textual and spatial grounding (Giovannini et al., 6 Jan 2025).
JDocQA: Targets Japanese documents with 11,600 QA pairs from pamphlets, reports, and slides, requiring both textual and visual reasoning, and made notable by explicit bounding-box clues and unanswerable questions (Onami et al., 2024).
BBox DocVQA: Focuses on fine-grained, region-grounded QA for scientific and technical PDFs, encompassing multi-box, multi-page scenarios with 30,780 training and 1,623 benchmark QA pairs (Yu et al., 19 Nov 2025).
CircularsVQA/DrishtiKon: Supports benchmarks with multi-granular (block, line, word, point) ground-truth for interpretability in text-and-drawing-rich government circulars (Kasuba et al., 26 Jun 2025).

Each adopts annotation formats designed to link answer strings to pixel/coordinate evidence, essential for spatial reasoning and hallucination mitigation.

2. Model Architectures and Decomposition Strategies

Early approaches deployed monolithic VLMs to embed entire pages, with answer extraction either by text synthesis or off-the-shelf spatial decoders. Recent systems advance in two main directions:

Demand-Side Ingestion and Modularization: The Deferred Visual Ingestion (DVI) framework (Xu, 15 Feb 2026) foregoes expensive supply-side (pre-ingestion) VLM runs by indexing basic metadata (drawing type, section, BOM, topology) and performing VLM-based image reasoning only on user-demanded pages. A classifier first routes queries by type and required modality (text vs. drawing), then page localization precedes image+query submission to a VLM; this approach minimizes up-front cost, scales to massive engineering document sets, and supports interactive refinement/caching.
Decoupling Answer Generation and Grounding: DocExplainerV0 introduces a plug-and-play module that, given a black-box VLM answer, predicts bounding boxes from joint vision-text embeddings (Chen et al., 12 Sep 2025). This decouples semantic QA from localization, enabling retrofitting of grounding capability to proprietary closed-weight VLMs. The bounding box predictor is trained with a Smooth L1 loss on normalized coordinates, substantially improving mean IoU while preserving answer accuracy.

Methods further benefit from multi-granular region matching (e.g., DrishtiKon’s block-to-point hierarchy), supervised contrastive learning for answer-region alignment, and spatially-aware prompting, all tailored to the format of the underlying document (engineering drawing, chart, or form).

3. Visual Grounding Mechanisms

Precise linking of answer text to document regions distinguishes drawing-grounded QA from text-only VQA.

Region Extraction: Pipelines employ high-precision segmenters (e.g., Segment Anything Model), multi-lingual robust OCR (e.g., DocTR + Surya-OCR), and bounding-box post-processing, yielding candidate regions from which semantic filters select logical units (figure, table, text block) (Yu et al., 19 Nov 2025, Kasuba et al., 26 Jun 2025).
Region Matching Algorithms: Drawing-grounded QA increasingly employs composite matching scores integrating:
- Fuzzy string overlap
- Token-wise intersection
- Length normalization
- Contextual similarity using embedding cosine similarity
- Penalties for size and irrelevance
- Top-K candidates are sorted by these composite scores at multiple granularities (block, line, word, point), maximizing alignment with answer text (Kasuba et al., 26 Jun 2025).
Prompt Engineering: Injection of spatial coordinates (token-wise bboxes), structured OCR tuples, or region-restricted crops into model prompts guides generative models toward faithful and localized predictions, with automatic regex post-processing to enhance output robustness (Giovannini et al., 6 Jan 2025).
Human-in-the-Loop Verification: Critical benchmarks (e.g., BBox DocVQA) employ manual verification of region assignments, ensuring the spatial evidence aligns strictly with QA content (Yu et al., 19 Nov 2025).

4. Evaluation Metrics and Groundedness

Beyond classic Exact Match or BLEU metrics on answer text, spatial grounding demands evaluation along multimodal axes:

Intersection over Union (IoU): Used to compare predicted bounding boxes $b_{\text{pred}}$ with ground-truth $b^*$ : $\text{IoU}(b_{\text{pred}}, b^*) = \frac{|b_{\text{pred}} \cap b^*|}{|b_{\text{pred}} \cup b^*|}$ , with thresholds (e.g., IoU $\geq 0.5$ ) for statistical reporting (Onami et al., 2024, Yu et al., 19 Nov 2025).
SMuDGE: A composite metric weighting both semantic (type-aware normalized string similarity) and spatial (normalized centroid distance with exponential decay) groundedness, parameterized by $\alpha$ for user trade-off. SMuDGE robustly penalizes ungrounded hallucinations, type mismatches, and spatial errors, yielding scores aligned with human judgment and model calibration (Nourbakhsh et al., 24 Mar 2025).
Precision, Recall, F1: Applied at each granularity (block, line, word), calculated as the proportion of predicted regions matching ground-truth at a given IoU, supporting ablations of localization and multi-region handling (Kasuba et al., 26 Jun 2025).
LLM-based Answer Correctness: For ambiguous textual answers, LLM judges (e.g., DeepSeek-v3.1) assess semantic correctness rather than strict string match (Yu et al., 19 Nov 2025).

Evaluation thus discourages "hallucinated" answers unguided by evidence, emphasizing models' capacity for both semantic fidelity and visual traceability.

5. Empirical Performance and Systematic Findings

Quantitative results across benchmarks confirm persistent challenges and motivators:

DVI Framework: Achieves comparable overall accuracy to supply-side approaches (46.7% vs. 48.9%), with 50% effectiveness for visually necessary queries (vs. 0% for pre-ingest), 100% page localization, and order-of-magnitude lower VLM consumption (Xu, 15 Feb 2026).
BoundingDocs: Adding bounding box metadata yields the highest ANLS* (91.6); spatially-aware prompting and answer extraction enhances robustness and discourages model drift (Giovannini et al., 6 Jan 2025).
BBox DocVQA: State-of-the-art models such as Qwen2.5VL-72B reach 68.6% answer accuracy and 35.2% mean IoU in zero-shot; perfect bbox crop injection lifts answer rates by 10–18.5 percentage points, exposing a critical grounding-reasoning gap. Multi-box/multi-page questions degrade both metrics, underlining key areas for future improvement (Yu et al., 19 Nov 2025).
DrishtiKon: Line-level granularity achieves optimal F1 (69.10%); ablative studies show best results when two blocks or five lines are used to recover all relevant visual evidence, without introducing excess noise (Kasuba et al., 26 Jun 2025).

Table: Selected Drawing-Grounded QA Benchmark Results

System	Task/Setting	Answer Acc.	Grounding (IoU/Recall/F1)	Notes
DVI (Xu, 15 Feb 2026)	Engineering Drawings	46.7%	100% page localization	Zero VLM at ingest; 50% visual queries
BBox DocVQA (Yu et al., 19 Nov 2025)	Scientific PDFs (GT crop)	81.5–99.9%	0.9–35.2% mean IoU	10–18.5pp gain from GT crop vs. whole page
DrishtiKon (Kasuba et al., 26 Jun 2025)	CircularsVQA (line lvl)	–	73.68% recall, F1 69.10%	Line-level best trade-off

These findings highlight the principal performance bottleneck as spatial grounding, rather than language generation.

6. System Design Patterns and Trade-offs

Drawing-grounded QA system design exposes several architectural and methodological trade-offs:

Indexing Cost vs. On-Demand Reasoning: Pre-ingestion VLM pipelines incur high up-front compute and possible completeness/retrievability failures if facts are missed at indexing. Deferred Ingestion systems avoid such costs, focusing on post-hoc localization and targeted visual reasoning, thus reducing wastage and enabling interactive refinement (Xu, 15 Feb 2026).
Prompt and Input Engineering: Injecting precise region cues or structured OCR into prompt templates regularizes attention and reduces answer drift, but may expose the system to parsing errors or overfitting to specific input formats (Giovannini et al., 6 Jan 2025).
Granularity of Evidence: Multi-granular matching (block, line, word, point) offers a precision-recall trade-off; fine-level precision can fall off due to OCR or token misalignment, while coarser-grained approaches aid recall for multi-part evidentiary answers (Kasuba et al., 26 Jun 2025).
Plug-and-Play vs. End-to-End: Modular grounding modules (e.g., DocExplainerV0) can be integrated post hoc into black-box VLMs, preserving answer quality while enhancing traceability; however, fully end-to-end joint training could, in principle, close the performance gap at the expense of generality and system flexibility (Chen et al., 12 Sep 2025).

Caching, progressive refinement, and strategic query routing further amortize cost and improve user-interactivity.

7. Future Directions and Open Challenges

The field continues to evolve across several axes:

Layout- and Region-Aware Transformers: Integration of layout modeling (e.g., LayoutLMv3) and region-selective attention may further improve multi-modal fusion and region specificity (Onami et al., 2024).
Automated and Scalable Benchmarking: Automated pipelines (e.g., Segment→Judge→Generate) now generate large-scale, region-annotated QA datasets but must maintain consistency, coverage, and human-in-the-loop verification for critical ground-truth (Yu et al., 19 Nov 2025).
Evaluation Standardization: Broader adoption of groundedness metrics (e.g., SMuDGE) to penalize ungrounded or hallucinated responses will better reflect reasoning ability, calibration, and real-world usability (Nourbakhsh et al., 24 Mar 2025).
Multi-Page and Multi-Region Reasoning: Open problems include robust cross-page linking and aggregation of multi-box evidence, with accuracy and grounding degrading in these scenarios (Yu et al., 19 Nov 2025).
Multilinguality and Document Diversity: Multilingual OCR, diverse layout processing, and adaptation to various scientific, administrative, and industrial document types remain key frontiers (Onami et al., 2024, Kasuba et al., 26 Jun 2025).

A plausible implication is that drawing-grounded Document QA will increasingly require hybrid retrieval-generation architectures, extensive multi-granular annotation, and unified sense-making across both spatial and semantic modalities to approach human-level interpretability and reliability.