Drawing-Grounded Document QA

Updated 15 January 2026

Drawing-grounded Document QA is a framework that links text and graphics by explicitly grounding answers in spatial regions, enhancing interpretability.
It employs multimodal architectures—from OCR-first LLMs to neuro-symbolic hybrids—using bounding boxes and chain-of-thought strategies for precise localization.
Research addresses challenges in symbol ambiguity and diagrammatic reasoning through advanced datasets, spatial metrics like IoU, and validator-distilled feedback.

Drawing-grounded document question answering (QA) is the task of extracting and localizing answers from documents that contain both textual and graphical elements, where evidence for the answer is visually grounded—often as bounding boxes or regions within document images, diagrams, charts, or engineering drawings. This paradigm mandates explicit spatial linkage between an answer and the underlying visual context, supporting both answer interpretability and rigorous spatial reasoning essential for document intelligence systems and visually rich document understanding (VRDU). Current research spans dataset curation, model architectures, spatial annotation schemas, and evaluation protocols, with ongoing challenges in symbol-centric and diagrammatic reasoning.

1. Datasets and Spatial Annotation Frameworks

Drawing-grounded document QA requires large-scale, spatially annotated benchmarks. Notable resources include:

BoundingDocs (Giovannini et al., 6 Jan 2025): Integrates ten public collections, encompassing 48,151 documents and 237,437 pages with 249,016 QA pairs. Each answer is linked to word-level bounding boxes normalized to a 0–1000 grid. Example annotation:

{
  "value": "$576,405.00",
  "location": [ [90, 11, 364, 768] ],
  "page": 7
}

Normalization is via

$x_\text{norm} = \lfloor x_\text{raw} \cdot 1000 / W_\text{raw}\rfloor$

BBox-DocVQA (Yu et al., 19 Nov 2025): Uses a “Segment–Judge–Generate” pipeline (SAM segmentation, VLM-based semantic filtering, GPT-5 for QA generation, human verification), yielding 3,751 documents and 32k QA pairs, each with bounding boxes over text, tables, or images.
RefChartQA (Vogel et al., 29 Mar 2025): Provides 73,702 chart QA pairs for grounding visual answers, with each box formatted as <box> x_min y_min x_max y_max </box> and annotated for chart-level, series-level, and data-point-level locations.
DrishtiKon (Kasuba et al., 26 Jun 2025): Curates fine-grained, multi-granular ground-truth across block, line, word, and point levels using domain-specific OCR and hierarchical region matching.
AECV-Bench (Kondratenko et al., 8 Jan 2026): Compiles 192 QA pairs across floor plans and architectural drawings, each QA linked to explicit evidence regions.

Spatial grounding is evaluated using intersection-over-union (IoU) metrics:

$\mathrm{IoU}(B_\text{pred}, B_\text{gt}) = \frac{|B_\text{pred}\cap B_\text{gt}|}{|B_\text{pred}\cup B_\text{gt}|}$

Granularity selection varies with context, with line-level grounding optimal for text-rich images (Kasuba et al., 26 Jun 2025), and multi-bbox annotation used for charts, diagrams, and multi-region answers (Yu et al., 19 Nov 2025, Vogel et al., 29 Mar 2025).

2. Model Architectures and Spatial Reasoning Workflows

The model landscape for drawing-grounded QA includes both unimodal text-based and multimodal vision-language approaches:

OCR-first LLMs (Giovannini et al., 6 Jan 2025): Employ text-only or bounding-box markup (“token|bbox”), recasting all IE tasks as QA. Fine-tuning and prompting with explicit spatial tokens yield improvement in ANLS* and reduce error rates; best practices include regex fallback for JSON parsing.
Multimodal Retrieval-Augmented Generation (RAG) (Suri et al., 2024): VisDoMRAG leverages parallel textual and visual retrieval branches, fusing evidence with consistency-constrained modality fusion:

$\mathcal{L}_{cons} = \sum_{i=1}^N \| h^t_i - h^v_i\|^2$

Chain-of-thought (CoT) reasoning and evidence curation are applied across modalities.

Instruction-Tuned VLMs (Vogel et al., 29 Mar 2025): Vision-language encoder-decoders, such as TinyChart for chart QA, merge spatial tokens and text in a unified autoregressive sequence, leveraging token-merging modules to model visual correlation among chart elements.
Validator-Distilled Models (Mohammadshirazi et al., 27 Nov 2025): DocVAL introduces teacher-student distillation with a multi-module validator (VAL) enforcing answer correctness, bounding box consistency, and reasoning trace validity. Student VLMs learn from validated CoT traces and iterative VAL feedback.
Symbolic and Neuro-Symbolic Hybrid Systems (Kondratenko et al., 8 Jan 2026): For AEC drawings, hybrid encoders supplement raster with vector, and neuro-symbolic parsing decomposes line-art into explicit graphs for robust symbol reasoning and instance counting.

Below is a summary table of representative architectures:

Approach	Modality/Granularity	Grounding Method
BoundingDocs LLM	OCR, word/line, 0–1000 grid	Per-token bbox markup
VisDoMRAG (RAG)	Text + figures, page/figure level	Late fusion, consistency CoT
TinyChart (RefChartQA)	Vision-Language, chart-level	Token merging + seq2seq bbox
DocVAL	Vision-Language, pixel/region	Validator-filtered CoT distill
DrishtiKon	Text + OCR, multi-granular	Region matching (block-line-word)
AECV-Bench	Hybrid CAD/images, evidence region	Raster/vector + judge adjudicator

3. Task Categories and Evaluation Metrics

Drawing-grounded document QA spans multiple query types:

Text Extraction (OCR): Locate and reproduce textual fields (high accuracy, e.g. 0.92–0.95 per (Kondratenko et al., 8 Jan 2026)).
Instance Counting: Enumerate symbols (doors, windows, chart elements); remains challenging (e.g. 0.40–0.62 accuracy on doors/windows).
Spatial Reasoning: Infer relationships from geometry or layout.
Comparative Reasoning: Select elements or regions by comparative assessment (size, count, adjacency).
Evidence Localization: Explicitly link answers to regions (bounding boxes, points, chart elements).

Metrics include:

Exact Match (EM): Binary string equality.
F1_token: Overlap of token sets.
Intersection-over-Union (IoU): For bounding boxes.
ANLS* (Average Normalized Levenshtein Similarity):

$\mathrm{ANLS}(a_p, a_{gt}) = \max\bigl(0, 1 - \frac{\mathrm{Lev}(a_p,a_{gt})}{\max\{|a_{gt}|,|a_p|\}}\bigr)$

Average Precision ([email protected]): For spatial bounding box prediction.
MAPE (Mean Absolute Percentage Error): For counting tasks.
BLEU-2, ROUGE-L, METEOR, BERTScore: For free-form explanations and semantic overlap.

Evaluation protocols frequently employ LLM-as-a-judge scoring and human adjudication for ambiguous or borderline cases (Kondratenko et al., 8 Jan 2026), reporting per-category and overall accuracy.

4. Prompting, Instruction Tuning, and Reasoning Strategies

Effective grounding benefits from tailored prompts and reasoning chains:

Bounding Box-Aware Prompting: Token-level markup (“word|bbox”) as input dramatically enhances grounding fidelity (Giovannini et al., 6 Jan 2025). Example:
1
[Invoice|[12,34,120,45]] [Date|[122,34,200,45]] : [30-Oct-1998|[210,34,320,45]] …

Instruction Tuning for Charts/Drawings: Prompts interleave directions for grounding and answer generation, e.g.,

1
2
3

<image>\n Please locate chart elements that support your answer.
  Q: {question}
  A: <box> … </box> | {answer}

Chain-of-Thought (CoT): Stepwise reasoning is enforced across modalities, with validated reasoning traces improving both accuracy and spatial consistency (Suri et al., 2024, Mohammadshirazi et al., 27 Nov 2025).
Multi-Granular Matching: Matching starts at block-level and successively refines to line, word, and point, optimizing for highest F1 (Kasuba et al., 26 Jun 2025).

Instruction-tuned models balance QA fluency and spatial accuracy with dual-objective losses:

$\mathcal{L}(\theta) = \lambda_{qa}\,\mathcal{L}_{qa} + \lambda_{vis}\,\mathcal{L}_{vis}$

5. Comparative Benchmarking, Failure Modes, and Analysis

Recent evaluations report:

Spatial grounding remains bottleneck: Models routinely achieve high answer accuracy (up to 81.5% for GPT-5) but low IoU for region localization (≤35%) (Yu et al., 19 Nov 2025). Multi-region and multi-page scenarios are especially challenging.
OCR-centric tasks are “solved”; symbol-centric and diagrammatic reasoning are “unsolved”: Instance counting for complex symbols (doors, windows, chart bars) yields low accuracy (0.18–0.62), especially when no textual fallback exists (Kondratenko et al., 8 Jan 2026).
Vision-LLMs limited in fine localization: LLaMA-3.1 + region matching achieves F1=69.10% (line level), Qwen2.5-VL (OCR-free) is ineffective (<5% F1 block-level) (Kasuba et al., 26 Jun 2025).
Model comparison (AECV-Bench): Gemini 3 Pro, GPT-5.2, Claude Opus 4.5 lead overall (accuracy 0.72–0.85), but performance drops on spatial and counting (Kondratenko et al., 8 Jan 2026).

Failure modes include:

Symbol ambiguity, graphical diversity, rasterization errors: Doors miscounted as windows; double-leaf doors treated as single; dense regions skipped or hallucinated.
Multi-step numeric parsing: Scale-bar interpretation and dimension extraction—models often misread or fail to perform arithmetic (Doris et al., 2024).
Reasoning errors: Hallucination in chain-of-thought, improper region selection, lack of interpretability in complex diagram queries.

A plausible implication is the necessity for hybrid systems combining domain-specific symbol detectors, graph-based reasoning, and multimodal retrieval (e.g., YOLO-based for doors/windows plus LLM for layout) (Kondratenko et al., 8 Jan 2026).

6. Best Practices and Future Research Directions

Recommended strategies across recent works:

Embed spatial context: Normalize bounding box coordinates and inject as input tokens or elements (Giovannini et al., 6 Jan 2025, Vogel et al., 29 Mar 2025).
Multi-task instruction tuning: Combine QA-only and grounding-augmented samples to maintain answer quality and spatial precision (Vogel et al., 29 Mar 2025).
Validator-based distillation: Leverage validated chain-of-thought feedback during student model training for enhanced grounding (Mohammadshirazi et al., 27 Nov 2025).
Domain-specialized architectures: Hybrid raster-vector, neuro-symbolic parsing, and tool-augmented agents for architectural, engineering, and chart-centric domains (Kondratenko et al., 8 Jan 2026).
Human-in-the-loop workflows: Employ UI-linked evidence verification and LLM judges for robust QA evaluation (Taneja et al., 14 Feb 2025).
Compositional objectives: Enforce topological graph consistency and structure-aware learning for drawings and diagrams.

Near-term research directions include: integrating vectorized CAD formats, augmenting training with synthetic diagrams and technical drawings, expanding annotation granularity, and developing agents capable of fully compositional spatial reasoning. The stable capability gradient from text-centric to symbol-centric QA underscores the need to bridge current architectural and methodological gaps in drawing literacy.

7. Applications and Broader Implications

Drawing-grounded document QA underpins diverse applications, including:

Invoice and form automation: Spatially precise extraction in financial and regulatory documents (Giovannini et al., 6 Jan 2025).
Scientific diagram understanding: QA over plots, charts, and annotated figures in papers (Yu et al., 19 Nov 2025, Suri et al., 2024).
Engineering CAD and compliance: Document-grounded visual QA for technical design workflows, regulatory auditing, and technical standards enforcement (Doris et al., 2024).
Architectural plan analysis: Automated extraction, counting, and reasoning in floor plans and construction documents (Kondratenko et al., 8 Jan 2026).
Interactive explainer systems: Systems like MuDoC enable cross-linked textbook queries with figure navigation and visual localization (Taneja et al., 14 Feb 2025).

This field is integral to advancing trust, auditability, and automation in visually rich document workflows across domains where spatial grounding and multimodal reasoning are critical for accuracy and interpretability.