Papers
Topics
Authors
Recent
Search
2000 character limit reached

DocVQA: Benchmark for Document VQA

Updated 12 March 2026
  • DocVQA is a large-scale dataset with 50,000 QA pairs over 12,767 document images, designed to evaluate multimodal reasoning, layout understanding, and OCR interpretation.
  • It offers detailed annotations with nine reasoning categories, covering aspects like form extraction, table reading, layout analysis, and handwritten text recognition.
  • Evaluation using metrics like ANLS and exact match accuracy reveals current model shortcomings, especially in handling complex layout and multi-modal information.

The DocVQA dataset is a large-scale, high-diversity benchmark for document visual question answering (VQA), designed to evaluate and drive progress in systems that read and reason over scanned or born-digital document images using natural language questions. It targets extractive and open-ended question answering, typically requiring semantic understanding, layout reasoning, OCR interpretation, and incorporation of multimodal cues beyond raw text, thus challenging both text-centric and vision-language architectures (Mathew et al., 2020, Mathew et al., 2020, Tito et al., 2021).

1. Dataset Composition and Structure

DocVQA comprises 12,767 document page images (from 6,071 unique documents) and a total of 50,000 natural language question–answer (QA) pairs (Mathew et al., 2020). The document sources are drawn from the UCSF Industry Documents Library (years: 1900–2018) and span five industry sectors (tobacco, food, drug, fossil fuel, chemical). Documents reflect a cross-section of real-world types: running-text reports, letters, forms, tables, diagrams, photographs, typewritten and handwritten entries, and born-digital layouts.

Each QA pair is associated with a single page, where the answer is explicitly present as a text span or short open-text response. The dataset is partitioned into training, validation, and test splits—train: 10,194 images / 39,463 questions; val: 1,286 images / 5,349 questions; test: 1,287 images / 5,188 questions (Mathew et al., 2020). All documents and questions are in English.

Text density is notably high compared to scene-text VQA datasets: the average page contains approximately 183 OCR tokens, with scanned image resolutions between 600–1200 dpi. OCR annotations include bounding boxes per token (Mathew et al., 2020).

Table: DocVQA and Related Datasets

Dataset Images Qa Pairs Avg. OCR Tokens Domain
DocVQA 12,767 50,000 182.8 Document images
TextVQA 28,000 45,000 7.9 Scene images w/ text
ST-VQA 23,000 31,000 10.4 Scene text

2. Annotation Protocol and QA Typology

Annotation employed a three-stage workflow:

  • Stage 1: Annotators authored up to ten natural-language questions per page, requiring extractive answers directly visible on the image.
  • Stage 2: Independent verification—annotators re-entered and validated answers, assigned reasoning-based question types (from nine categories), and flagged ambiguous cases.
  • Stage 3: Author review for correction or removal of non-matching or low-quality items.

Each question is tagged with one or more of nine reasoning categories:

  1. Form field extraction (key–value pairs)
  2. Handwritten (answers in hand-written text)
  3. Layout (header/layout/structural reasoning)
  4. Running-text comprehension
  5. Table/list reading
  6. Figure (charts, plots)
  7. Photograph (entities in embedded images)
  8. Yes/no (binary inference)
  9. Other (miscellaneous open-ended) (Mathew et al., 2020)

Approximate distribution: layout (25%), table/list (23%), form (16%), figure (9%), running-text (8%), handwritten (6%), photograph (6%), yes/no (6%). Questions may have multiple tags.

Example QAs:

  • “What date was this form signed?” → handwritten field
  • “What is the header at the top-left?” → layout
  • “What is the total amount in the second column?” → table

3. Data Formats and Access

Each record in DocVQA consists of:

  • Document page image (PNG/JPEG)
  • A list of OCR tokens with bounding boxes
  • One or more question–answer pairs per page

Formally, data instances can be viewed as tuples (di,qi,ai)(d_i, q_i, a_i) with di=(Ii,Ti,Bi)d_i = (I_i, T_i, B_i), where IiI_i is the image, TiT_i the token sequence, and BiB_i bounding boxes. Extensive alternative ground-truth answer variants are collected for lexical flexibility (Mathew et al., 2020).

All splits are accessible in standard formats suitable for vision-language and text-only models; OCR spatial data supports layout-aware approaches.

4. Evaluation Protocols and Metrics

The primary benchmark metric is Average Normalized Levenshtein Similarity (ANLS), softly penalizing minor OCR or prediction mismatches:

Let d(a,a^)d(a,\,\hat{a}) be Levenshtein distance between reference aa and prediction a^\hat{a},

NED(a,a^)=d(a,a^)max(a,a^),S(a,a^)=max(0,1NED(a,a^))\text{NED}(a,\,\hat{a}) = \frac{d(a,\,\hat{a})}{\max(|a|,\,|\hat{a}|)}, \qquad S(a,\,\hat{a}) = \max(0, 1 - \text{NED}(a,\,\hat{a}))

ANLS=1QqQS(aq,a^q)\boxed{\text{ANLS} = \frac{1}{|Q|} \sum_{q \in Q} S(a_q,\,\hat{a}_q)}

Exact match accuracy is also reported:

Acc.=#{exactly correct answers}Qtotal×100%\mathrm{Acc.} = \frac{\#\{\text{exactly correct answers}\}}{Q_{\mathrm{total}}}\times 100\%

The test set’s human upper bound is 94.36% accuracy, with top model (BERT-large-squad finetuned) reaching 55.77% accuracy (ANLS=0.665\text{ANLS}=0.665). Baselines include LoRRA (VQA, 7.63%), M4C (24.81%), and OCR substring upper bound (87.0%) (Mathew et al., 2020). Retrieval tasks (for the Document Collection VQA sub-dataset) use Mean Average Precision (MAP) (Mathew et al., 2020).

Performance varies sharply by question type: strongest on tables, forms, and yes/no (RC: 6075%60-75\% acc.), weakest on layout and figure questions.

5. Challenges, Insights, and Benchmarks

Key challenges include:

  • Robust layout and structure modeling—flat token concatenation is inadequate for fields, tables, and multi-region cues.
  • Handwriting and OCR variability—handwritten fields reduce accuracy by ~20 points; OCR noise propagates errors.
  • Multi-modal content—figures, embedded images, charts, and photographs currently evade most models.
  • Open-vocabulary, copy-based answer space—QA models cannot restrict to a closed answer set (Mathew et al., 2020, Mathew et al., 2020).

Top-performing models leverage multimodal transformers, pointer networks, and spatial-region reasoning (e.g., M4C, TILT, LayoutLMv2). Nonetheless, large model–human gaps persist, especially for layout, figure, and multi-span reasoning (Tito et al., 2021).

DocVQA established a common benchmark leading directly to its derivatives:

  • ICDAR 2021 Challenge: Introduced Infographics VQA (layout-rich, arithmetic-heavy) (Tito et al., 2021).
  • iDocVQA: Reformulates DocVQA with high-level natural language instructions for each QA pair, showing instruction-tuning yields statistically significant but modest gains (up to 31× vs. zero-shot, though absolute accuracy lags behind human) (Adewumi et al., 2024).
  • BBox-DocVQA: Adds explicit bounding box supervision for spatial reasoning, supporting fine-grained evaluation of region grounding (Yu et al., 19 Nov 2025).
  • Privacy-Aware DocVQA: Adopts federated and differential privacy settings using a large invoice document corpus where provider identity is considered sensitive (Tito et al., 2023).
  • M3DocVQA: Shifts to open-domain, multi-document retrieval-based DocVQA over >3,000 PDFs, through multi-modal retrieval pipelines (Cho et al., 2024).

DocVQA remains widely adopted as a pretraining corpus, finetune benchmark, and testbed for OCR-aware, layout-robust vision–LLMs, and highlights the necessity for explicit document structure modeling, multimodal fusion, and robust, generalized semantic reasoning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DocVQA Dataset.