DOCRcase-200K: Fine-Grained Error Analysis Dataset
- DOCRcase-200K is a comprehensive dataset featuring over 200K document parsing instances annotated for text, table, and equation errors.
- It employs advanced VLM-based extraction and multi-step annotation, providing detailed error taxonomy and audit trails for robust document analysis.
- The dataset underpins model training and benchmarking by enabling precise error detection and driving measurable improvements in parsing system performance.
DOCRcase-200K is a large-scale, fine-grained error analysis dataset constructed specifically for the training and evaluation of document parsing systems, particularly those based on vision LLMs (VLMs). Originating as the foundation for the DOCR-Inspector evaluation paradigm, DOCRcase-200K enables exhaustive classification of parsing errors in structured elements such as text, tables, and equations, reflecting the complexity of real-world unstructured document images—including English, Chinese, and mixed-language cases—across a broad spectrum of technical and non-technical domains (Zhang et al., 11 Dec 2025).
1. Dataset Scope, Sources, and Structure
DOCRcase-200K consists of 212,424 parsing instances, each comprising a cropped document image, associated ground truth parse (e.g., text string, table HTML, or LaTeX equation), and a set of labeled error types (possibly empty for error-free cases). The coverage includes:
- Text elements: 110,040 (51.8%)
- Table elements: 57,510 (27.1%)
- Equation elements: 44,874 (21.1%)
Cropped images total ≈84,000, distributed as 65% English-only and 35% mixed English/Chinese documents. The dataset spans diverse document genres sampled from:
- Full-page corpora: olmOCR-mix-1025 (5,000 pages from various domains), CDLA (Chinese document layouts)
- Augmented element pools: UniMER-1M (10,000 equations), internal repositories (10,000 tables)
- Application domains: scientific papers, technical reports, slide decks, newspapers, forms, webpages, and documents with multi-language captions (Zhang et al., 11 Dec 2025).
2. Dataset Composition and Error Distribution
Each dataset entry is annotated for detailed error typology. Cases are distributed as follows:
- "Good Cases" (no error): 18.83%
- Text: 20,003/110,040
- Table: 10,000/57,510
- Equation: 10,000/44,874
- "Bad Cases" (single error): 65.17%
- "Bad Cases" (2–4 compound, non-interfering errors): 16.00%
Eleven main block types are reflected, including text blocks, text with inline formulas, titles, lists, captions, table bodies, and equations. The error type coverage and their frequencies were engineered to ensure balanced representation, with, for example, "inline formula recognition error" appearing in 24,279 cases (10.95%) (Zhang et al., 11 Dec 2025).
| Element Type | Count | Representative Error Types |
|---|---|---|
| Text | 110,040 | Text repetition, character error, |
| missed formula | ||
| Table | 57,510 | Missing column/row, redundancy, |
| cell content error | ||
| Equation | 44,874 | Structure, syntax, character error |
3. Data Generation, Annotation, and Error Taxonomy
3.1 Extraction and Case Construction
Element crops and ground truth are sourced by document layout detection (MinerU2.0-vlm) and verified/refined with VLMs (e.g., Qwen2.5-VL-72B-Instruct). Additional equations and tables are directly incorporated from UniMER-1M and high-quality internal datasets.
3.2 Error Synthesis
Erroneous cases are generated via:
- Rule-based perturbations (formatting, span, character modifications)
- LLM-guided hallucinations and semantic/structural distortions (using Gemini 2.5 Flash)
- Real-world failure mining (complex failures from production parsers)
- Compound assembly (combining 2–4 non-interfering error types per sample)
3.3 Annotation
All cases are labeled according to a 28-type fixed error taxonomy, further grouped by element and error granularity. Annotation includes a "Chain-of-Checklist" (CoCL) reasoning trace for each error, auto-generated and verified for logical completeness.
3.4 Error Typology
- Text (17 types): Classification, formatting, span, character, and inline formula errors
- Table (6 types): Integrity, structure, and content errors
- Equation (5 types): Type, integrity, and content errors
Example Error Types
- Text repetition
- Paragraph format error
- Table cell recognition lost
- Missing table column
- Partial displayed formula missing
4. Statistical Profiles and Illustrative Examples
For a dataset of total size , the empirical frequency for error type is: where is the count for error type . Good case and single-error rates: Inline formula recognition error occurs in of cases.
Representative examples include:
- Text repetition: Duplicate sentences in a paragraph (ID 8)
- Missing table column: Omitted
<td>tag in HTML table representation (ID 20) - Partial displayed formula missing: Formula with omitted terms in recognized LaTeX (ID 26)
5. Application in Model Training and Evaluation
DOCRcase-200K underpins DOCR-Inspector and similar frameworks by providing training and benchmarking data for fine-grained, automated parsing quality assessment. Empirical validation is performed on DOCRcaseBench (882 real-world, human-annotated cases), where models—such as DOCR-Inspector-7B—demonstrate case-F1 scores over 96% for text, 86% for tables, and 85% for equations, with an error-type F1 uplift exceeding 25 percentage points over alternative models (Zhang et al., 11 Dec 2025).
6. Evaluation Protocols and Metrics
Model evaluation against DOCRcase-200K employs:
- Case-level detection: Precision, recall, F1
- Error-type detection (multi-label): Precision, Recall, F1
- Pass@K: For sampling chains
- Distribution alignment: Pearson/Kendall tau against ground-truth automated metrics (edit distance, TEDS, CDM)
Refinement loops demonstrate that fine-grained error feedback derived from DOCRcase-200K provides the largest downstream gains in metrics such as edit distance and TEDS, compared to binary or no-guidance baselines.
7. Use Cases and Contextual Impact
DOCRcase-200K addresses the need for large-scale, error-labeled corpora that reflect the nuanced, multi-element parsing challenges encountered in real-world document digitization and analysis. Its detailed error codification enables:
- Fine-grained training of VLM-based document parsers
- Automated assessment and continual refinement of parsing system outputs
- Analysis of error patterns for system diagnosis and targeted improvements
- Benchmarking of parsing models for robustness, granularity, and domain coverage
A plausible implication is that DOCRcase-200K, through its taxonomy, scale, and annotation methodology, establishes a new standard for validation and development in document understanding research, enhancing reproducibility and systematic progress in the field (Zhang et al., 11 Dec 2025).