MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Published 30 Mar 2026 in cs.CV and cs.AI | (2603.28130v1)

Abstract: We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper establishes a systematic benchmark using 3,400 images across 17 languages to evaluate document parsing in both digital-born and photographed real-world scenarios.
It employs a robust annotation pipeline combining automated layout detection, OCR/VLM recognition, and strict human verification for high annotation fidelity.
Experimental results highlight significant performance drops in non-Latin scripts and photographed conditions, stressing the need for improved multilingual, context-aware models.

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Motivation and Scope

MDPBench addresses significant gaps in document parsing research by establishing a systematic benchmark for multilingual and real-world photographed document parsing. While legacy benchmarks focus predominantly on well-formatted, digital-born documents from high-resource languages, MDPBench explicitly targets both digital and photographed documents across 17 languages, including low-resource scripts and diverse capture conditions. This benchmark is designed to serve as a rigorous standard for evaluating document parsing systems and general-purpose VLMs, emphasizing robustness, linguistic diversity, and practical deployment scenarios.

Figure 1: Overview of the MDPBench, illustrating the distribution of languages, document types, and conditions.

Dataset Construction and Annotation Pipeline

MDPBench comprises 3,400 document images, curated to maximize diversity in type, layout complexity, visual elements (e.g., formulas, tables, charts), and authenticity of real-world conditions. Electronic documents are sourced from global platforms encompassing academic papers, business reports, handwritten notes, archives, newspapers, textbooks, and comics in 17 languages. Samples undergo multi-stage filtering to eliminate low-quality or trivial documents, and are then converted to photographed data through printing/screens and capture under indoor/outdoor, variable lighting, deformation, and background variation scenarios.

Figure 2: Visualization of digital-born vs. photographed documents, capturing real-world variability.

Annotation follows a three-stage pipeline: initial layout detection via expert models (dots.ocr, PaddleOCR-VL), block-level recognition by multiple state-of-the-art OCR/VLMs, followed by consensus-based selection, comprehensive manual correction for layout, reading order, and element validation, and final strict human verification to ensure high annotation fidelity.

Figure 3: Pipeline of the rigorous annotation process, integrating automated modeling and manual review.

Evaluation Protocols

MDPBench adopts a page-level aggregation evaluation strategy to counter element-type distribution imbalances, which are severe in multilingual and photographed document contexts. This mitigates biases inherent in benchmarks such as OmniDocBench, where overall metrics may be disproportionately influenced by element types (e.g., formulas in English academic papers). For evaluation, Normalized Edit Distance (NED) is used for text and reading order; formulas rely on CDM to address syntactic variations; tables employ TEDS for structural verification. Benchmark splits are enforced for public (2720 samples) and private (680 samples) subsets to prevent targeted training and data leakage.

Experimental Results

A comprehensive evaluation across general VLMs, specialized VLMs, and pipeline systems exposes substantial performance differentials:

Closed-source models significantly outperform open-source alternatives. Gemini-3-Pro achieves 86.4% overall accuracy and sets the SOTA in 14 of 17 languages. The best open-source model, dots.mocr, achieves only 80.5%, revealing a 7.9% gap in photographed scenarios.
All models show pronounced degradation on photographed documents, averaging a 17.8% performance drop. Gemini-3-Pro is relatively robust yet still suffers a 5.3% loss under real-world imaging conditions.
Non-Latin-script languages are notably under-served. Models experience a 14.0% average drop compared to Latin scripts. MinerU2.5 and MonkeyOCR generalize well to Latin languages but drop below 10% on Arabic/Hindi, indicating severe script bias.
Figure 4: Example Gemini-3-Pro parsing results with errors in photographed document scenarios.

Error Analysis and Model Limitations

MDPBench enables fine-grained analysis of multilingual document parsing failure modes:

Language-specific errors: Hindi vowel diacritic neglect, Russian Cyrillic misclassification with Latin, Thai unspaced hallucinations that disrupt lexical segmentation.
Repetitive outputs and language drift: Models hallucinate repeated content or erroneously decode Vietnamese as Chinese, reflecting training data bias and insufficient linguistic modeling.
Reading order errors in right-to-left scripts: Models consistently misorder Arabic two-column documents, failing to handle script directionality.
Figure 5: Typical language-specific parsing errors across Hindi, Russian, and Thai.

Figure 6: Hallucination phenomena—repetitive outputs and language drift in document parsing.

Figure 7: Misreading order in Arabic documents; models overlook right-to-left and column structure.

Pipeline-based tools exhibit error accumulation, inefficiency, and context-induced hallucinations. End-to-end VLMs demonstrate competitive efficiency but are limited by long-context reasoning and SOTA proprietary training paradigms.

Implications and Future Perspectives

MDPBench exposes fundamental weaknesses in current document parsing approaches regarding multilingual robustness, photographed document reliability, and script diversity. The pronounced performance collapse in real-world and non-Latin scenarios indicates that both open-source and proprietary models require enhanced data coverage, linguistic modeling, and architectural adaptation. Practical deployments will demand improvements in visual encoding, context management, reading order prediction, and bias mitigation.

MDPBench will catalyze research in:

Unified multimodal architectures with improved script handling and layout reasoning
Augmentation of training data for under-represented scripts/photographed conditions
Advanced evaluation protocols for out-of-distribution robustness and cross-linguistic generalization
Benchmark-driven progress in deployment-ready document parsing for pre-training corpora, business/institutional applications, and global archival digitization

Conclusion

MDPBench establishes a rigorous foundation for evaluating and advancing multilingual, real-world document parsing. Its comprehensive coverage, high-quality annotation, and strict evaluation illuminate deficiencies in both open-source and proprietary models, especially for photographed and non-Latin-script documents. MDPBench will underpin future work in robust, inclusive document parsing and facilitate the development of VLMs with superior multilingual and practical deployment capabilities (2603.28130).

Markdown Report Issue