Document Visual Question Answering
- Document VQA is a research field integrating OCR, visual feature extraction, and multimodal reasoning to interpret complex document layouts.
- It employs varied methodologies, including two-stage pipelines, end-to-end transformers, and vision-language models with document adapters for effective question answering.
- The approach supports tasks ranging from span extraction to relational and numerical reasoning, with applications in finance, insurance, and archival research.
Document Visual Question Answering (VQA) is a research area focused on developing machine learning models capable of answering natural language questions by reasoning over the visual and structural content of documents. These systems integrate components for visual feature extraction, text detection, and sophisticated multimodal reasoning to process documents that may include diverse visual layouts, dense text, tables, diagrams, or forms. In contrast to conventional VQA, which primarily deals with photographs, document VQA leverages optical character recognition (OCR), spatial layout modeling, and fusion of visual and linguistic signals to solve tasks—ranging from span extraction to relational reasoning or evidence retrieval—over both single documents and multi-page or multi-document collections (Mathew et al., 2020, Huynh et al., 7 Jan 2025, Tito et al., 2021, Ding et al., 2023, Tanaka et al., 2023).
1. Problem Formulation and Task Variants
Document VQA tasks are framed as answering questions posed about document images or image collections . The task formulation and requirements depend on task granularity and context:
- Single-Document QA: Given a document image and question , extract an answer as a text span or short answer present in . Questions span various categories (figure, table, layout, yes/no, handwriting) and typically require reading or structural understanding of the page (Mathew et al., 2020).
- Collection-Level and Multi-Image QA: The input is a set of images and question . The system must:
- Retrieve an evidence subset containing all documents or pages necessary for reasoning.
- Produce an answer , where may involve aggregation, logic, or arithmetic across multiple documents (Tito et al., 2021, Tanaka et al., 2023).
- Some benchmarks (e.g., SlideVQA) explicitly annotate both evidence (slide indices) and answers, requiring multi-hop and numerical reasoning (Tanaka et al., 2023).
Question and answer types include extraction (free text, region index), binary (yes/no), counts, range queries, and relational answers (parent/child section linkages, arithmetic expressions) (Mathew et al., 2020, Ding et al., 2023, Tanaka et al., 2023).
2. Datasets and Annotation Protocols
Document VQA has been advanced by several benchmark datasets, each characterizing distinct aspects of the task:
| Dataset | Images / Pages | Questions / QA Pairs | Special Features | Reference |
|---|---|---|---|---|
| DocVQA | 12,767 | 50,000 | Layout-diverse docs, 9 Q types | (Mathew et al., 2020) |
| DocCVQA | 14,362 | 20 | Collection-level, evidence subset, complex queries | (Tito et al., 2021) |
| PDF-VQA | 111,538 | 1,012,263 | Multi-page PDFs, element/structural/relational Q | (Ding et al., 2023) |
| SlideVQA | 52,000+ | 14,500+ | Slide decks, multi-hop, arithmetic, annotations | (Tanaka et al., 2023) |
Annotation protocols leverage OCR outputs with manual correction for ground-truth fidelity. Task-specific annotation includes linking answers to evidence slides/pages, constructing explicit arithmetic expressions for numerical questions, and QA template filling for structural relations. A key trend is the move toward large-scale, multi-page, and semantically-rich annotation, enabling robust multi-document reasoning evaluation (Tito et al., 2021, Ding et al., 2023, Tanaka et al., 2023).
3. Model Architectures and Reasoning Paradigms
Document VQA models can be grouped into three architectural families (Huynh et al., 7 Jan 2025):
A. Two-Stage (OCR → Reasoner) Pipelines:
- First, OCR detects and recognizes text regions, extracting tokens and bounding boxes (e.g., via Tesseract or CRAFT).
- Next, extracted text/position features and the query are encoded by a neural reasoner (often a Transformer or BERT/QA model).
- Sample: Text-spotting coupled with BERT span-extraction, sometimes employing key–value extraction and structured queries (Tito et al., 2021).
B. End-to-End Document-Aware Transformers:
- Unified vision-language Transformers ingest image patches/tokens (from CNN or ViT), positional/layout encodings, OCR tokens, and question embeddings.
- Self- and cross-attention layers enable joint reasoning over spatial, textual, and visual signals.
- Examples: LayoutLMv2 (BERT with layout/visual features), DocFormer (self/cross-attention), M4C (pointer network over OCR-object tokens) (Huynh et al., 7 Jan 2025, Ding et al., 2023).
- Graph-based GCN: PDF-VQA introduces GCNs over document elements (nodes with visual+textual features, spatial/hierarchical edges) for structural and relational reasoning (Ding et al., 2023).
C. Large Vision-LLMs (LVLMs) with Document Adapters:
- LVLMs (BLIP-2, Donut, InstructBLIP) are adapted for documents by inserting OCR/layout adapters or prompt tuning.
- Donut uses a ViT-VQGAN encoder and decoder, predicting answer tokens directly from image inputs.
- Adapters map detected text or layout signals into LVLM embedding spaces, enabling document grounding with minimal model modification (Huynh et al., 7 Jan 2025).
Multi-document tasks have seen the emergence of unified sequence-to-sequence models: SlideVQA frames both evidence retrieval and answer extraction as a single generation problem via a transformer-based encoder–decoder sequence output “evidence indices || answer” (Tanaka et al., 2023).
4. Evaluation Metrics and Benchmarking
Evaluation protocols are multi-faceted, commonly divided by task type:
- Span/Extractive QA: Average Normalized Levenshtein Similarity (ANLS), granting partial credit for variants and near-matches:
- Retrieval: Mean Average Precision (MAP), Precision@k, and Recall@k for evidence document selection tasks. Combined DocCVQA metrics multiply retrieval and QA ANLS or accuracy for holistic scoring (Tito et al., 2021).
- Multi-Answer List Matching: ANLSL (ANLS for lists via optimal bipartite matching), capturing order-agnostic multi-item accuracy (Tito et al., 2021).
- Sequence Models: Main metrics jointly average evidence-selection F1 and answer F1 (SlideVQA). Jaccard overlap for evidence slide indices.
- Discrete Tasks: Exact match for binary/counting/region-class answers, F1 for multi-token spans, and accuracy for element or relation prediction (Ding et al., 2023).
Representative model performance on public benchmarks includes:
- DocVQA (Test): Top ANLS ∼0.85 (PingAn-OneConnect-Gammalab-DQA); Baseline M4C: ∼0.52; LayoutLMv2: ∼0.70; Donut: ∼0.60 (Huynh et al., 7 Jan 2025, Mathew et al., 2020).
- DocCVQA: Retrieval MAP ∼73% (Text-spotting+BERT) with ANLSL 0.45–0.71 across baselines; Database+Database achieves highest QA scores (Tito et al., 2021).
- PDF-VQA: Graph GCN model outperforms six baselines (acc. 78.9% overall, +10–12% over pointer-based M4C on reasoning tasks) (Ding et al., 2023).
- SlideVQA: Unified seq2seq models achieve ∼40 joint F1 (vs. human ∼85); numerical reasoning remains most challenging (Tanaka et al., 2023).
5. Advances in Multimodal and Multidocument Reasoning
Recent developments focus on modeling hierarchical document structure, spatial relationships, and evidence aggregation:
- Graph-based Reasoning: PDF-VQA demonstrates that representing documents as heterogeneous graphs (nodes: elements; edges: spatial/hierarchical) and applying Relational GCNs yields improved multi-hop relational understanding and layout-sensitive predictions. Ablation studies show that removing hierarchy or spatial edges drops accuracy by 3–4% (Ding et al., 2023).
- Unified Generation: SlideVQA’s seq2seq approach is empirically superior to cascaded retrieval+reader baselines for evidence selection and answer synthesis, especially for multi-slide and arithmetic questions (Tanaka et al., 2023).
- Multi-page and Cross-document Pretraining: Pretraining document Transformers to directly handle multiple pages or documents, leveraging retrieval supervision and cross-page attention, is considered a next step for improving collection-level VQA (Tito et al., 2021).
- Numeric and Symbolic Reasoning: Datasets such as SlideVQA annotate intermediate arithmetic formulas, pushing models to ground entities and reason symbolically, with embedded neural arithmetic components suggested as future directions (Tanaka et al., 2023).
6. Applications, Challenges, and Outlook
Document VQA underpins essential workflows in finance (invoice extraction, cross-form consistency checks), insurance (multi-page claim analysis), archive research (historical record aggregation), contract understanding, and compliance auditing (Tito et al., 2021, Huynh et al., 7 Jan 2025). Robust deployment requires accurate OCR, domain-adaptive layout modeling, and low-latency inference on privacy-sensitive data.
Ongoing challenges include:
- Propagation and correction of OCR errors.
- Generalization to diverse and unseen document templates or handwriting.
- Integrating structured data (tables, charts) with unstructured text for seamless downstream QA.
- Cross-document and hierarchical multi-level reasoning for paginated or collection-based corpora.
- Scarcity of cross-lingual and low-resource document QA resources.
- Explainability through evidence highlighting, supporting regulatory or user-facing transparency (Huynh et al., 7 Jan 2025).
Future research is expected to further integrate graph neural architectures, layout-aware large vision–LLMs, and domain-specific adapters, with a focus on unified retrieval–reasoning frameworks capable of interactive, chain-of-thought elaboration over visually and linguistically complex document ecosystems (Tito et al., 2021, Ding et al., 2023, Tanaka et al., 2023, Huynh et al., 7 Jan 2025).