DocVQA: Benchmark for Document VQA

Updated 12 March 2026

DocVQA is a large-scale dataset with 50,000 QA pairs over 12,767 document images, designed to evaluate multimodal reasoning, layout understanding, and OCR interpretation.
It offers detailed annotations with nine reasoning categories, covering aspects like form extraction, table reading, layout analysis, and handwritten text recognition.
Evaluation using metrics like ANLS and exact match accuracy reveals current model shortcomings, especially in handling complex layout and multi-modal information.

The DocVQA dataset is a large-scale, high-diversity benchmark for document visual question answering (VQA), designed to evaluate and drive progress in systems that read and reason over scanned or born-digital document images using natural language questions. It targets extractive and open-ended question answering, typically requiring semantic understanding, layout reasoning, OCR interpretation, and incorporation of multimodal cues beyond raw text, thus challenging both text-centric and vision-language architectures (Mathew et al., 2020, Mathew et al., 2020, Tito et al., 2021).

1. Dataset Composition and Structure

DocVQA comprises 12,767 document page images (from 6,071 unique documents) and a total of 50,000 natural language question–answer (QA) pairs (Mathew et al., 2020). The document sources are drawn from the UCSF Industry Documents Library (years: 1900–2018) and span five industry sectors (tobacco, food, drug, fossil fuel, chemical). Documents reflect a cross-section of real-world types: running-text reports, letters, forms, tables, diagrams, photographs, typewritten and handwritten entries, and born-digital layouts.

Each QA pair is associated with a single page, where the answer is explicitly present as a text span or short open-text response. The dataset is partitioned into training, validation, and test splits—train: 10,194 images / 39,463 questions; val: 1,286 images / 5,349 questions; test: 1,287 images / 5,188 questions (Mathew et al., 2020). All documents and questions are in English.

Text density is notably high compared to scene-text VQA datasets: the average page contains approximately 183 OCR tokens, with scanned image resolutions between 600–1200 dpi. OCR annotations include bounding boxes per token (Mathew et al., 2020).

Table: DocVQA and Related Datasets

Dataset	Images	Qa Pairs	Avg. OCR Tokens	Domain
DocVQA	12,767	50,000	182.8	Document images
TextVQA	28,000	45,000	7.9	Scene images w/ text
ST-VQA	23,000	31,000	10.4	Scene text

2. Annotation Protocol and QA Typology

Annotation employed a three-stage workflow:

Stage 1: Annotators authored up to ten natural-language questions per page, requiring extractive answers directly visible on the image.
Stage 2: Independent verification—annotators re-entered and validated answers, assigned reasoning-based question types (from nine categories), and flagged ambiguous cases.
Stage 3: Author review for correction or removal of non-matching or low-quality items.

Each question is tagged with one or more of nine reasoning categories:

Form field extraction (key–value pairs)
Handwritten (answers in hand-written text)
Layout (header/layout/structural reasoning)
Running-text comprehension
Table/list reading
Figure (charts, plots)
Photograph (entities in embedded images)
Yes/no (binary inference)
Other (miscellaneous open-ended) (Mathew et al., 2020)

Approximate distribution: layout (25%), table/list (23%), form (16%), figure (9%), running-text (8%), handwritten (6%), photograph (6%), yes/no (6%). Questions may have multiple tags.

Example QAs:

“What date was this form signed?” → handwritten field
“What is the header at the top-left?” → layout
“What is the total amount in the second column?” → table

3. Data Formats and Access

Each record in DocVQA consists of:

Document page image (PNG/JPEG)
A list of OCR tokens with bounding boxes
One or more question–answer pairs per page

Formally, data instances can be viewed as tuples $(d_i, q_i, a_i)$ with $d_i = (I_i, T_i, B_i)$ , where $I_i$ is the image, $T_i$ the token sequence, and $B_i$ bounding boxes. Extensive alternative ground-truth answer variants are collected for lexical flexibility (Mathew et al., 2020).

All splits are accessible in standard formats suitable for vision-language and text-only models; OCR spatial data supports layout-aware approaches.

4. Evaluation Protocols and Metrics

The primary benchmark metric is Average Normalized Levenshtein Similarity (ANLS), softly penalizing minor OCR or prediction mismatches:

Let $d(a,\,\hat{a})$ be Levenshtein distance between reference $a$ and prediction $\hat{a}$ ,

$\text{NED}(a,\,\hat{a}) = \frac{d(a,\,\hat{a})}{\max(|a|,\,|\hat{a}|)}, \qquad S(a,\,\hat{a}) = \max(0, 1 - \text{NED}(a,\,\hat{a}))$

$\boxed{\text{ANLS} = \frac{1}{|Q|} \sum_{q \in Q} S(a_q,\,\hat{a}_q)}$

Exact match accuracy is also reported:

$\mathrm{Acc.} = \frac{\#\{\text{exactly correct answers}\}}{Q_{\mathrm{total}}}\times 100\%$

The test set’s human upper bound is 94.36% accuracy, with top model (BERT-large-squad finetuned) reaching 55.77% accuracy ( $\text{ANLS}=0.665$ ). Baselines include LoRRA (VQA, 7.63%), M4C (24.81%), and OCR substring upper bound (87.0%) (Mathew et al., 2020). Retrieval tasks (for the Document Collection VQA sub-dataset) use Mean Average Precision (MAP) (Mathew et al., 2020).

Performance varies sharply by question type: strongest on tables, forms, and yes/no (RC: $60-75\%$ acc.), weakest on layout and figure questions.

5. Challenges, Insights, and Benchmarks

Key challenges include:

Robust layout and structure modeling—flat token concatenation is inadequate for fields, tables, and multi-region cues.
Handwriting and OCR variability—handwritten fields reduce accuracy by ~20 points; OCR noise propagates errors.
Multi-modal content—figures, embedded images, charts, and photographs currently evade most models.
Open-vocabulary, copy-based answer space—QA models cannot restrict to a closed answer set (Mathew et al., 2020, Mathew et al., 2020).

Top-performing models leverage multimodal transformers, pointer networks, and spatial-region reasoning (e.g., M4C, TILT, LayoutLMv2). Nonetheless, large model–human gaps persist, especially for layout, figure, and multi-span reasoning (Tito et al., 2021).

DocVQA established a common benchmark leading directly to its derivatives:

ICDAR 2021 Challenge: Introduced Infographics VQA (layout-rich, arithmetic-heavy) (Tito et al., 2021).
iDocVQA: Reformulates DocVQA with high-level natural language instructions for each QA pair, showing instruction-tuning yields statistically significant but modest gains (up to 31× vs. zero-shot, though absolute accuracy lags behind human) (Adewumi et al., 2024).
BBox-DocVQA: Adds explicit bounding box supervision for spatial reasoning, supporting fine-grained evaluation of region grounding (Yu et al., 19 Nov 2025).
Privacy-Aware DocVQA: Adopts federated and differential privacy settings using a large invoice document corpus where provider identity is considered sensitive (Tito et al., 2023).
M3DocVQA: Shifts to open-domain, multi-document retrieval-based DocVQA over >3,000 PDFs, through multi-modal retrieval pipelines (Cho et al., 2024).

DocVQA remains widely adopted as a pretraining corpus, finetune benchmark, and testbed for OCR-aware, layout-robust vision–LLMs, and highlights the necessity for explicit document structure modeling, multimodal fusion, and robust, generalized semantic reasoning.