DocVQA: Benchmark for Document VQA
- DocVQA is a large-scale dataset with 50,000 QA pairs over 12,767 document images, designed to evaluate multimodal reasoning, layout understanding, and OCR interpretation.
- It offers detailed annotations with nine reasoning categories, covering aspects like form extraction, table reading, layout analysis, and handwritten text recognition.
- Evaluation using metrics like ANLS and exact match accuracy reveals current model shortcomings, especially in handling complex layout and multi-modal information.
The DocVQA dataset is a large-scale, high-diversity benchmark for document visual question answering (VQA), designed to evaluate and drive progress in systems that read and reason over scanned or born-digital document images using natural language questions. It targets extractive and open-ended question answering, typically requiring semantic understanding, layout reasoning, OCR interpretation, and incorporation of multimodal cues beyond raw text, thus challenging both text-centric and vision-language architectures (Mathew et al., 2020, Mathew et al., 2020, Tito et al., 2021).
1. Dataset Composition and Structure
DocVQA comprises 12,767 document page images (from 6,071 unique documents) and a total of 50,000 natural language question–answer (QA) pairs (Mathew et al., 2020). The document sources are drawn from the UCSF Industry Documents Library (years: 1900–2018) and span five industry sectors (tobacco, food, drug, fossil fuel, chemical). Documents reflect a cross-section of real-world types: running-text reports, letters, forms, tables, diagrams, photographs, typewritten and handwritten entries, and born-digital layouts.
Each QA pair is associated with a single page, where the answer is explicitly present as a text span or short open-text response. The dataset is partitioned into training, validation, and test splits—train: 10,194 images / 39,463 questions; val: 1,286 images / 5,349 questions; test: 1,287 images / 5,188 questions (Mathew et al., 2020). All documents and questions are in English.
Text density is notably high compared to scene-text VQA datasets: the average page contains approximately 183 OCR tokens, with scanned image resolutions between 600–1200 dpi. OCR annotations include bounding boxes per token (Mathew et al., 2020).
Table: DocVQA and Related Datasets
| Dataset | Images | Qa Pairs | Avg. OCR Tokens | Domain |
|---|---|---|---|---|
| DocVQA | 12,767 | 50,000 | 182.8 | Document images |
| TextVQA | 28,000 | 45,000 | 7.9 | Scene images w/ text |
| ST-VQA | 23,000 | 31,000 | 10.4 | Scene text |
2. Annotation Protocol and QA Typology
Annotation employed a three-stage workflow:
- Stage 1: Annotators authored up to ten natural-language questions per page, requiring extractive answers directly visible on the image.
- Stage 2: Independent verification—annotators re-entered and validated answers, assigned reasoning-based question types (from nine categories), and flagged ambiguous cases.
- Stage 3: Author review for correction or removal of non-matching or low-quality items.
Each question is tagged with one or more of nine reasoning categories:
- Form field extraction (key–value pairs)
- Handwritten (answers in hand-written text)
- Layout (header/layout/structural reasoning)
- Running-text comprehension
- Table/list reading
- Figure (charts, plots)
- Photograph (entities in embedded images)
- Yes/no (binary inference)
- Other (miscellaneous open-ended) (Mathew et al., 2020)
Approximate distribution: layout (25%), table/list (23%), form (16%), figure (9%), running-text (8%), handwritten (6%), photograph (6%), yes/no (6%). Questions may have multiple tags.
Example QAs:
- “What date was this form signed?” → handwritten field
- “What is the header at the top-left?” → layout
- “What is the total amount in the second column?” → table
3. Data Formats and Access
Each record in DocVQA consists of:
- Document page image (PNG/JPEG)
- A list of OCR tokens with bounding boxes
- One or more question–answer pairs per page
Formally, data instances can be viewed as tuples with , where is the image, the token sequence, and bounding boxes. Extensive alternative ground-truth answer variants are collected for lexical flexibility (Mathew et al., 2020).
All splits are accessible in standard formats suitable for vision-language and text-only models; OCR spatial data supports layout-aware approaches.
4. Evaluation Protocols and Metrics
The primary benchmark metric is Average Normalized Levenshtein Similarity (ANLS), softly penalizing minor OCR or prediction mismatches:
Let be Levenshtein distance between reference and prediction ,
Exact match accuracy is also reported:
The test set’s human upper bound is 94.36% accuracy, with top model (BERT-large-squad finetuned) reaching 55.77% accuracy (). Baselines include LoRRA (VQA, 7.63%), M4C (24.81%), and OCR substring upper bound (87.0%) (Mathew et al., 2020). Retrieval tasks (for the Document Collection VQA sub-dataset) use Mean Average Precision (MAP) (Mathew et al., 2020).
Performance varies sharply by question type: strongest on tables, forms, and yes/no (RC: acc.), weakest on layout and figure questions.
5. Challenges, Insights, and Benchmarks
Key challenges include:
- Robust layout and structure modeling—flat token concatenation is inadequate for fields, tables, and multi-region cues.
- Handwriting and OCR variability—handwritten fields reduce accuracy by ~20 points; OCR noise propagates errors.
- Multi-modal content—figures, embedded images, charts, and photographs currently evade most models.
- Open-vocabulary, copy-based answer space—QA models cannot restrict to a closed answer set (Mathew et al., 2020, Mathew et al., 2020).
Top-performing models leverage multimodal transformers, pointer networks, and spatial-region reasoning (e.g., M4C, TILT, LayoutLMv2). Nonetheless, large model–human gaps persist, especially for layout, figure, and multi-span reasoning (Tito et al., 2021).
6. Extensions, Related Corpora, and Impact
DocVQA established a common benchmark leading directly to its derivatives:
- ICDAR 2021 Challenge: Introduced Infographics VQA (layout-rich, arithmetic-heavy) (Tito et al., 2021).
- iDocVQA: Reformulates DocVQA with high-level natural language instructions for each QA pair, showing instruction-tuning yields statistically significant but modest gains (up to 31× vs. zero-shot, though absolute accuracy lags behind human) (Adewumi et al., 2024).
- BBox-DocVQA: Adds explicit bounding box supervision for spatial reasoning, supporting fine-grained evaluation of region grounding (Yu et al., 19 Nov 2025).
- Privacy-Aware DocVQA: Adopts federated and differential privacy settings using a large invoice document corpus where provider identity is considered sensitive (Tito et al., 2023).
- M3DocVQA: Shifts to open-domain, multi-document retrieval-based DocVQA over >3,000 PDFs, through multi-modal retrieval pipelines (Cho et al., 2024).
DocVQA remains widely adopted as a pretraining corpus, finetune benchmark, and testbed for OCR-aware, layout-robust vision–LLMs, and highlights the necessity for explicit document structure modeling, multimodal fusion, and robust, generalized semantic reasoning.