DOCR-Inspector: Fine-Grained Parsing Eval

Updated 18 December 2025

DOCR-Inspector is a vision-language model-based system that automates the fine-grained evaluation of document parsing outputs by detecting 28 distinct error types across text, tables, and equations.
It employs a novel VLM-as-a-Judge architecture with chain-of-checklist reasoning to systematically analyze and categorize parsing errors for actionable diagnostics.
Empirical benchmarks demonstrate its superior accuracy and interpretability over traditional aggregate metrics, enhancing document parsing and pipeline refinement.

DOCR-Inspector is a vision-LLM (VLM)-based system for the fine-grained and automated evaluation of document parsing outputs, with particular focus on detecting and classifying parsing errors at an element level across text, tables, and mathematical formulas. By formalizing assessment as comprehensive error detection and leveraging a predefined taxonomy of 28 parsing error types, DOCR-Inspector addresses key limitations of prior benchmark-centric evaluation, offering actionable insight for both diagnostic analysis and refinement of document parsing pipelines (Zhang et al., 11 Dec 2025).

1. Problem Formulation and Motivation

The objective of document parsing is to convert unstructured images—such as PDFs, scientific papers, and scanned documents—into reliable semi-structured digital representations. Existing approaches typically rely on aggregate metrics (e.g., edit distance, TEDS, BLEU), which reduce diverse error modes to singular scores and frequently obscure nuanced model weaknesses. Such approaches also suffer from dataset-specific biases, leading to unstable model rankings and poor generalizability to in-the-wild scenarios. In practical information extraction pipelines where ground truth is unavailable, even isolated parsing errors can critically degrade downstream tasks, underscoring the need for detailed and robust evaluation methods.

DOCR-Inspector reframes evaluation as fine-grained error detection. Given an input document image $\mathrm{Doc}$ and a parser $M$ , a set of element crops $\mathcal{C} = \{\mathrm{crop}_1, \ldots, \mathrm{crop}_n\}$ is extracted, and the parser produces outputs

$\mathrm{pred}_{doc} = M(\mathrm{Doc}, \mathcal{C}) = \{\mathrm{pred}_{crop_1}, \ldots, \mathrm{pred}_{crop_n}\}.$

For each $(\mathrm{crop}_i, \mathrm{pred}_{crop_i})$ pair, the system detects and classifies all parsing errors into the finite taxonomy $\mathcal{E}$ of 28 types.

2. Framework and Reasoning Methodology

2.1. VLM-as-a-Judge Paradigm

DOCR-Inspector implements a VLM-as-a-Judge architecture. Its core components are:

A visual encoder (e.g., Vision Transformer) to extract image features from each element crop.
A LLM conditioned on these features and the corresponding parser output.
An alignment module bridging visual and textual modalities, allowing the LLM to directly "read" and reason over image content in comparison to the parsed output.

Inference prompts the VLM to scrutinize the image-versus-output alignment, explicitly identifying and classifying present errors.

2.2. Chain-of-Checklist Reasoning

A distinctive aspect is the "Chain-of-Checklist" (CoCL) reasoning strategy, which eschews free-form reasoning (e.g., Chain-of-Thought) for structured, checklist-driven analysis. Each element category (text, table, equation) has an associated set of $K$ checklist items $C_1, \ldots, C_K$ , each corresponding to a discrete error mode. For text, the checklist includes up to 14 dimensional checks (such as paragraph formatting, list markers, inline formula presence), with the model outputting structured triplets $(C_k, \mathrm{label}_k, \mathrm{reason}_k)$ indicating binary outcome and justification per item.

This structuring supports comprehensive coverage of known failure modes and delivers interpretable, traceable error reports.

2.3. Error-Type Taxonomy

The system's taxonomy covers 28 mutually exclusive error types, partitioned as follows:

Category	Error Types (number)	Key Examples
Text	17	misrecognized_as_table, list_format_error, segment_lost, char_loss, inline_formula_missed
Table	6	partial_redundancy, missing_row_or_column, merged_cell_error, cell_content_error
Equation	5	misrecognized_as_text, syntax_error, structure_error, char_error

Each type is operationalized via concise criteria, e.g., "partial_redundancy" (spurious rows/columns inserted), "char_loss" (1-5 characters deleted), or "structure_error" (fraction, root, or matrix layout altered).

3. Datasets and Training Regime

The principal training dataset, DOCRcase-200K, comprises 212,424 instances across text, tables, and equations, built from diverse sources:

5K pages from olmOCR-mix-1025
5K pages from CDLA
10K isolated equations from UniMER-1M
10K isolated tables from internal repositories

Ground-truth layouts derive from MinerU2.0-vlm, with textual cleanup by Qwen-72B. Synthetic diversity is induced by:

Rule-based perturbations (character/word drops, punctuation swaps)
LLM-guided perturbations (hallucinations, structural distortions, via Gemini-2.5 prompting)
Curated real-world failure cases

Data is balanced across "Good" (18.83%, no errors), single-error "Bad" (65.17%), and multi-error "Bad" (16.00%) examples. Training images are resized to a maximum of $1280 \times 1280$ , with tokenized text lengths clipped to [256, 1280]. Synthetic label and reasoning-trace pairs are generated via few-shot prompting with Qwen-72B.

4. Benchmarking and Empirical Evaluation

Empirical validation utilizes DOCRcaseBench, which consists of 882 real-world, element-level parsing cases, with triplicate expert annotation and exhaustive alignment to ground-truth via OmniDocBench matching. The set balances text (448 cases), table (242), and equation (192) elements and encompasses a spectrum of error types.

4.1. Baseline Comparisons

Competitor models include proprietary systems (GPT-4o, Gemini Flash, Gemini Pro Thinking) and open-source baselines (Qwen2.5-VL7B, Qwen2.5-VL72B, Qwen3-VL235B), tested both with and without reasoning paradigms (CoT, reasoning chains).

4.2. Quantitative Results

DOCR-Inspector-7B achieves state-of-the-art results across both Good-vs-Bad classification (case F1) and fine-grained error-type detection (F1). For example, on the DOCRcaseBench suite:

Text: Case F1 96.43% (vs. 88.46% for Gemini 2.5 Pro), Error-type F1 80.21% (vs. 32.90%)
Table: Case F1 86.41% (vs. 82.01%), Error-type F1 62.11% (vs. 32.93%)
Equation: Case F1 85.42% (vs. 77.19%), Error-type F1 73.81% (vs. 48.58%)

A detailed error breakdown confirms that precision and recall for error-type detection approaches 81% in text elements.

4.3. Qualitative Analyses

There is a strong inverse correlation between the number of error types detected and standard parsing quality metrics (edit distance, TEDS, CDM), substantiating the external validity of DOCR-Inspector's assessments. In downstream refinement tasks, DOCR-Inspector-7B's feedback, when used to guide infilling or correction models, yields substantially greater improvements than unguided or binary feedback approaches.

5. Practical Integration and Impact

DOCR-Inspector is designed to function as both evaluation middleware and an actionable driver for system improvement. Key practical deployments include:

Quality gating: Automated segregation of Good vs. Bad element parses, triggering fallback strategies or human intervention for Bad cases.
Reward modeling: Utilization of F1/precision/recall outputs as optimization signals for RL-based parser tuning.
Error monitoring dashboards: Real-time tracking of error-type frequencies to surface systemic drift or emergent failure modes.

Its comprehensive, structured outputs are readily incorporated into diverse document digitization, information extraction, and scientific publishing pipelines.

6. Limitations and Prospects

DOCR-Inspector is constrained by its dependence on a predefined taxonomy; out-of-taxonomy or novel failure modes may elude detection. The VLM-as-a-Judge architecture may yield false positives, particularly if visual-text alignment is impaired. Prospective directions include:

Expansion of the taxonomy to encompass additional categories (e.g., graphics, footnotes, multilingual scripts)
Integration of online learning for continuous taxonomy adaptation based on discovered failure modes
Development of lightweight, on-device evaluators for resource-constrained deployments

In summary, DOCR-Inspector introduces a granular, interpretable, and automatable paradigm for document parsing evaluation, unifying VLM-based comparison, structured reasoning, and actionable fine-grained error classification, with demonstrated superiority over both commercial and open-source baselines across multiple metrics and benchmarks (Zhang et al., 11 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DOCR-Inspector.