Papers
Topics
Authors
Recent
2000 character limit reached

DocPTBench: Parsing & Translation Benchmark

Updated 30 November 2025
  • DocPTBench is a comprehensive benchmark suite designed to assess document parsing and translation under realistic distortions from camera-captured documents.
  • It features three tiers—Original, Photographed, and Unwarped—with high-fidelity, human-verified annotations across diverse domains and multiple language pairs.
  • Empirical evaluations show significant accuracy drops in both parsing and translation when moving from digital-born images to in-the-wild captures, highlighting robustness gaps in current models.

DocPTBench is a comprehensive benchmark suite designed specifically to evaluate end-to-end document parsing and translation for photographed documents, introducing real-world geometric and photometric distortions absent from previous benchmarks. It targets both specialized document parsing pipelines and general-purpose multimodal LLMs (MLLMs) and provides high-fidelity, human-verified annotations across multiple document domains and languages. Empirical results demonstrate substantial degradation in both parsing and translation accuracy when moving from digital-born or scanned images to in-the-wild, camera-captured data, exposing the robustness gap in current models and pipelines (Du et al., 23 Nov 2025).

1. Rationale and Distinctive Challenges

Prevailing document parsing and translation benchmarks—such as OmniDocBench, DITrans, FoxPage, DoTA, and DIT700K—are constructed nearly exclusively from digital-born or high-quality scanned sources. These datasets implicitly assume ideal conditions: rectilinear page geometry, uniform lighting, and minimal noise. However, practical deployment scenarios frequently involve smartphone or camera-captured documents that exhibit:

  • Geometric distortions: perspective skew, page curvature, and foreshortening.
  • Photometric artifacts: uneven illumination, shadows, glare, color casts, motion blur, and defocus.
  • Real-world variability: variations in capture angle, distance, sensor noise, and background clutter.

Such uncontrolled acquisition conditions degrade both low-level text recognition (OCR) and high-level document structure parsing, propagating errors through subsequent translation or information extraction stages. Existing model pipelines—optimized for idealized input—demonstrate significant fragility under these distortions, underscoring the necessity for a dedicated "photographed document" benchmark that isolates and quantifies robustness deficits.

2. Dataset Structure, Domains, and Annotation

DocPTBench is built upon 981 core digital-born documents sourced from OmniDocBench, spanning diverse domains including invoices, forms, academic papers, magazines, and government documents. Each document is represented across three principal tiers:

  • Original: the base digital version (981 images).
  • Photographed: 1,381 images, comprising simulated geometric perturbations of the digital images and 400 physically re-photographed exemplars captured under four distinct lighting and angle regimes using various smartphone and DSLR cameras.
  • Unwarped: 1,381 images wherein each photographed version is post-processed through a commercial unwarping algorithm correcting geometric distortions only, providing an explicit test for the limits of geometry normalization.

Languages covered include bilingual English–Chinese in Originals, extended to eight bidirectional translation pairs (En↔Zh, En↔De, En↔Fr, En↔Ru).

Annotation leverages and extends OmniDocBench protocols:

  • Parsing: text-region bounding boxes, line segmentation, formula markup, table structure (cell/row/column), and explicit reading order, all verified via a two-stage human review process.
  • Translation: source–target Markdown pairs generated by Qwen-Max and then curated by bilingual linguists for semantic, syntactic, and layout fidelity, including random quality spot checks and consistency audits.

3. Task Definitions and Evaluation Protocols

DocPTBench defines two primary benchmarking tracks:

3.1 End-to-End Document Parsing

  • Input: image from any tier (Original / Photographed / Unwarped).
  • Output: serialized markup (Markdown) encoding text blocks, formulas, tables, and their reading order.
  • Metrics (all borrowed verbatim from OmniDocBench):
    • Levenshtein Edit Distance (ED(s,t)\mathrm{ED}(s, t)): lower is better.
    • Formula Edit Distance.
    • Table-Edit-Distance-based Similarity (TEDS), where

    TEDS=1TED(Tp,Tgt)max(Tp,Tgt)\mathrm{TEDS} = 1 - \frac{\mathrm{TED}(T_p,T_{gt})}{\max(|T_p|,|T_{gt}|)}

    for predicted and ground-truth table trees; higher values denote closer alignment. - Read Order Edit.

3.2 Document Translation

  • Eight language pairs: En→Zh, En→De, En→Fr, En→Ru, Zh→En, Zh→De, Zh→Fr, and Zh→Ru.

  • Prompting strategies:

    • Simple: Direct instruction (e.g., "Translate this document image into German").
    • Chain-of-Thought (CoT): Split into "extract source-language Markdown" followed by "translate Markdown."
    • Text-only MT: Baseline using only the extracted Markdown, isolating the translation capability.
  • Metrics:

    • BLEU: n-gram overlap with brevity penalty,

    BLEU=BPexp ⁣(1Nn=1Nlogpn)\mathrm{BLEU} = \mathrm{BP}\,\exp\!\bigl(\tfrac{1}{N}\sum_{n=1}^N \log p_n\bigr) - chrF: Character-level F-score. - METEOR: Unigram F-score with synonym/stemming, plus penalty. - STEDS: Semantic-structural TEDS, Markdown tree-based, for multilingual structure preservation.

4. Model Baselines and Experimental Procedure

Expert document parsing models include PaddleOCR-VL, MinerU2.5, dots.ocr, MonkeyOCR, Deepseek-OCR, Dolphin, olmOCR & olmOCR2, OCRFlux, SmolDocling, and Nanonets-OCR.

General-purpose MLLMs benchmarked comprise Gemini 2.5-Pro, Qwen-VL-Max, GLM-4.5V, Kimi-VL, Doubao-Seed-1.6-Vision, and open-source variants Qwen3-VL-4B, Qwen2.5-VL-3B, InternVL3-2B/3.5-2B, and Qwen2.5-VL-72B.

For evaluation, document images in each tier are held out by document ID to ensure strict partitioning. Unwarped images undergo only geometric correction, retaining photometric artifacts. No in-domain fine-tuning is permitted; all results are zero/few-shot using publicly available models and inference scripts on NVIDIA A100 GPUs with FP16 arithmetic and batch size 1.

5. Empirical Findings and Error Analysis

5.1 Parsing and Translation Degradation

  • MLLMs exhibit an average 18% increase in Levenshtein Edit Distance for parsing when shifted from Original to Photographed images.
  • Expert document parsing models suffer a 25% average degradation.
  • End-to-end translation BLEU falls by ~12% under CoT prompting when comparing Photographed to Original images.
  • Chain-of-Thought prompts recover only 3–8 BLEU points over Simple prompting but remain substantially below text-only translation, demonstrating the neural bottleneck imposed by upstream OCR/parsing noise.

5.2 Impact of Unwarping

  • Geometric correction recovers 70–90% of the performance loss in geometric metrics (e.g., Table TEDS regains much of its score), but residual errors remain due to photometric artifacts such as blur or illumination variation. This suggests that photometric robustness is a key unsolved problem.
  • Some MLLMs, when faced with Simple prompts, default to pure OCR output rather than true translation, revealing limitations in instruction following.
  • Translation errors stem from both noisy OCR/parsing input and the intrinsic neural MT quality of the models, with certain models (e.g., Doubao-Seed-1.6-Vision) displaying reasonable parsing but poor text-only translation.
  • Ablation with language pairs indicates that low-resource scripts (e.g., Cyrillic) exhibit even larger performance gaps under modality shift.

Key Summary Table:

Model Class Parsing Edit Distance Increase Translation BLEU Decrease
MLLMs +18% -12%
Experts +25%

Note: Table encapsulates average performance changes Original→Photographed. For specific model-by-model and scenario breakdowns, refer to full data in (Du et al., 23 Nov 2025).

6. Insights, Implications, and Methodological Lessons

  • Geometric distortions are the single largest error driver for most expert parsing systems, whose backbones assume axis-aligned input.
  • Photometric noise remains significant post-unwarping, especially for models lacking explicit robustness to blur, non-uniform illumination, or sensor artifacts.
  • Chain-of-Thought prompts decouple OCR and translation, partially mitigating instruction “collapse,” but token overhead and residual parsing errors still cap end-to-end translation results.
  • This suggests dedicated model enhancements (e.g., in-the-wild distortion augmentation, robust features in the vision encoder) and pipeline adaptations (e.g., dynamic CoT, hybrid unwarping-plus-LLM approaches) may be required.
  • The dual bottleneck—compounded OCR errors propagating into MT, and intrinsic MT model limitations—underscores the need for vertically integrated training and evaluation pipelines if robust photographed document translation is to be achieved.

7. Prospects for Extension and Community Utility

Future directions highlighted for DocPTBench include:

  • Expansion to video captures, burst sequences, and handwriting samples for coverage of temporal and script variability.
  • Addition of more languages, especially low-resource and non-Latin scripts.
  • Integration of advanced model-centric augmentation: training vision encoders with explicit geometric and photometric pertubations and developing adaptive prompting strategies.
  • DocPTBench provides rigorous, open-access infrastructure for systematic, reproducible comparison of document parsing and translation models—especially under conditions representative of real-world deployment, moving the field beyond laboratory-only controlled benchmarks.

All dataset resources and evaluation scripts are freely accessible at https://github.com/Topdu/DocPTBench (Du et al., 23 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DocPTBench.