DOCRcase-200K: Fine-Grained Error Analysis Dataset

Updated 18 December 2025

DOCRcase-200K is a comprehensive dataset featuring over 200K document parsing instances annotated for text, table, and equation errors.
It employs advanced VLM-based extraction and multi-step annotation, providing detailed error taxonomy and audit trails for robust document analysis.
The dataset underpins model training and benchmarking by enabling precise error detection and driving measurable improvements in parsing system performance.

DOCRcase-200K is a large-scale, fine-grained error analysis dataset constructed specifically for the training and evaluation of document parsing systems, particularly those based on vision LLMs (VLMs). Originating as the foundation for the DOCR-Inspector evaluation paradigm, DOCRcase-200K enables exhaustive classification of parsing errors in structured elements such as text, tables, and equations, reflecting the complexity of real-world unstructured document images—including English, Chinese, and mixed-language cases—across a broad spectrum of technical and non-technical domains (Zhang et al., 11 Dec 2025).

1. Dataset Scope, Sources, and Structure

DOCRcase-200K consists of 212,424 parsing instances, each comprising a cropped document image, associated ground truth parse (e.g., text string, table HTML, or LaTeX equation), and a set of labeled error types (possibly empty for error-free cases). The coverage includes:

Text elements: 110,040 (51.8%)
Table elements: 57,510 (27.1%)
Equation elements: 44,874 (21.1%)

Cropped images total ≈84,000, distributed as 65% English-only and 35% mixed English/Chinese documents. The dataset spans diverse document genres sampled from:

Full-page corpora: olmOCR-mix-1025 (5,000 pages from various domains), CDLA (Chinese document layouts)
Augmented element pools: UniMER-1M (10,000 equations), internal repositories (10,000 tables)
Application domains: scientific papers, technical reports, slide decks, newspapers, forms, webpages, and documents with multi-language captions (Zhang et al., 11 Dec 2025).

2. Dataset Composition and Error Distribution

Each dataset entry is annotated for detailed error typology. Cases are distributed as follows:

"Good Cases" (no error): 18.83%
- Text: 20,003/110,040
- Table: 10,000/57,510
- Equation: 10,000/44,874
"Bad Cases" (single error): 65.17%
"Bad Cases" (2–4 compound, non-interfering errors): 16.00%

Eleven main block types are reflected, including text blocks, text with inline formulas, titles, lists, captions, table bodies, and equations. The error type coverage and their frequencies were engineered to ensure balanced representation, with, for example, "inline formula recognition error" appearing in 24,279 cases (10.95%) (Zhang et al., 11 Dec 2025).

Element Type	Count	Representative Error Types
Text	110,040	Text repetition, character error,
		missed formula
Table	57,510	Missing column/row, redundancy,
		cell content error
Equation	44,874	Structure, syntax, character error

3. Data Generation, Annotation, and Error Taxonomy

3.1 Extraction and Case Construction

Element crops and ground truth are sourced by document layout detection (MinerU2.0-vlm) and verified/refined with VLMs (e.g., Qwen2.5-VL-72B-Instruct). Additional equations and tables are directly incorporated from UniMER-1M and high-quality internal datasets.

3.2 Error Synthesis

Erroneous cases are generated via:

Rule-based perturbations (formatting, span, character modifications)
LLM-guided hallucinations and semantic/structural distortions (using Gemini 2.5 Flash)
Real-world failure mining (complex failures from production parsers)
Compound assembly (combining 2–4 non-interfering error types per sample)

3.3 Annotation

All cases are labeled according to a 28-type fixed error taxonomy, further grouped by element and error granularity. Annotation includes a "Chain-of-Checklist" (CoCL) reasoning trace for each error, auto-generated and verified for logical completeness.

3.4 Error Typology

Text (17 types): Classification, formatting, span, character, and inline formula errors
Table (6 types): Integrity, structure, and content errors
Equation (5 types): Type, integrity, and content errors

Example Error Types

Text repetition
Paragraph format error
Table cell recognition lost
Missing table column
Partial displayed formula missing

4. Statistical Profiles and Illustrative Examples

For a dataset of total size $N = 212,424$ , the empirical frequency for error type $i$ is: $p_i = \frac{n_i}{N} \times 100\%$ where $n_i$ is the count for error type $i$ . Good case and single-error rates: $p_\mathrm{good} = 18.83\%, \quad p_1 = 65.17\%, \quad p_{\geq2} = 16.00\%$ Inline formula recognition error occurs in $10.95\%$ of cases.

Representative examples include:

Text repetition: Duplicate sentences in a paragraph (ID 8)
Missing table column: Omitted <td> tag in HTML table representation (ID 20)
Partial displayed formula missing: Formula with omitted terms in recognized LaTeX (ID 26)

5. Application in Model Training and Evaluation

DOCRcase-200K underpins DOCR-Inspector and similar frameworks by providing training and benchmarking data for fine-grained, automated parsing quality assessment. Empirical validation is performed on DOCRcaseBench (882 real-world, human-annotated cases), where models—such as DOCR-Inspector-7B—demonstrate case-F1 scores over 96% for text, 86% for tables, and 85% for equations, with an error-type F1 uplift exceeding 25 percentage points over alternative models (Zhang et al., 11 Dec 2025).

6. Evaluation Protocols and Metrics

Model evaluation against DOCRcase-200K employs:

Case-level detection: Precision, recall, F1
Error-type detection (multi-label): Precision $_\mathrm{type}$ , Recall $_\mathrm{type}$ , F1 $_\mathrm{type}$
Pass@K: For $K$ sampling chains

$\text{Recall@}K = \frac{\# \text{cases with correct error set in any of K outputs}}{N}$

Distribution alignment: Pearson/Kendall tau against ground-truth automated metrics (edit distance, TEDS, CDM)

Refinement loops demonstrate that fine-grained error feedback derived from DOCRcase-200K provides the largest downstream gains in metrics such as edit distance and TEDS, compared to binary or no-guidance baselines.

7. Use Cases and Contextual Impact

DOCRcase-200K addresses the need for large-scale, error-labeled corpora that reflect the nuanced, multi-element parsing challenges encountered in real-world document digitization and analysis. Its detailed error codification enables:

Fine-grained training of VLM-based document parsers
Automated assessment and continual refinement of parsing system outputs
Analysis of error patterns for system diagnosis and targeted improvements
Benchmarking of parsing models for robustness, granularity, and domain coverage

A plausible implication is that DOCRcase-200K, through its taxonomy, scale, and annotation methodology, establishes a new standard for validation and development in document understanding research, enhancing reproducibility and systematic progress in the field (Zhang et al., 11 Dec 2025).

Markdown Upgrade to Chat

References (1)

DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DOCRcase-200K.