LogicsParsingBench Evaluation Dataset
- LogicsParsingBench is a curated evaluation dataset for end-to-end LVLM document parsing, offering 1,078 diverse page-level PDF samples with challenging real-world layouts.
- It employs a two-stage training process combining supervised fine-tuning with layout-centric reinforcement learning to optimize text, layout, and reading order prediction.
- The benchmark uses orthogonal metrics such as Normalized Edit Distance, Table TEDS, and Reading Order Edit, enabling granular error analysis across varied document types.
LogicsParsingBench is a curated evaluation dataset introduced to rigorously benchmark large-scale document parsing models, especially those utilizing LVLM architectures and end-to-end learning paradigms. It comprises 1,078 page-level PDF images sampled to represent high variability and challenging layout designs across nine major categories and over twenty sub-categories, including scientific articles, technical reports, newspapers/magazines, music scores, complex formula-dense scientific papers, chemical documents, ancient Chinese texts, and other heterogeneous formats. The benchmark is specifically designed to stress models’ capabilities in recognizing text within visually complex, multi-column, or scientific documents (rich in tables, formulas, and specialized notations), demanding robust reading order and layout analysis beyond mere OCR.
1. Construction and Content of LogicsParsingBench
LogicsParsingBench selects real-world document samples to encompass maximal diversity in spatial layout, content type, and linguistic style. The dataset’s nine categories include: academic articles, technical reports, popular periodicals, STEM-heavy scientific papers, chemical formula documents, music notation, ancient Chinese texts, and several additional sub-domains known for structural heterogeneity. Each sample is annotated at the page level to facilitate scenario analysis for layout recognition (e.g. multi-column articles, nested tables, dense formula passages, data-rich chemical diagrams, and handwritten Chinese script).
The rationale for this selection is to provide a test corpus wherein parsing models must jointly address variability in content (language, formula, script), spatial arrangement (columns, blocks, tables), and non-standard reading order. Such coverage is essential for advancing document analysis systems capable of production deployment in scientific or enterprise environments.
2. Model Architecture and Training Methodology
Evaluation on LogicsParsingBench centers on the Logics-Parsing model built atop a strong vision-language backbone (Qwen2.5-VL-7B-Instruct). Training is staged:
- In the first stage, supervised fine-tuning (SFT) employs next-token prediction to acquire preliminary spatial grounding and OCR skills over normal text, mathematical formulas, tabular data (e.g., FinTabNet, TNCR, PubTabNet sources), chemical formula images (from ChEBI-20-MM), and manually annotated handwritten Chinese.
- In the second stage, Layout-Centric Reinforcement Learning (LC-RL) introduces a specialized reward-driven training regime. Group Relative Policy Optimization (GRPO) is used to refine the model’s global layout and reading order prediction.
The RL reward function is expressed:
where penalizes normalized Levenshtein edit distance of full-page output, scores geometric precision of bounding box detection, and penalizes paragraph/discourse sequence inversion against ground-truth reading order. The selection of weights () is calibrated to emphasize the logical consistency of reading flow and structural coherence alongside textual accuracy.
3. Evaluation Metrics and Protocols
LogicsParsingBench introduces several orthogonal metrics to quantify parsing efficacy:
- Normalized Edit Distance (NED): Computed as Levenshtein distance between concatenated ground truth and predicted text at page level, omitting headers/footers to avoid over-penalizing segmentation variance.
- Formula Edit and Chemistry Edit: For formula-dense documents and chemical diagrams, parsed LaTeX strings (formulas) and SMILES strings (chemistry) are evaluated for both syntactic and semantic correctness.
- Table TEDS and Table Edit: For tables, the Table TEDS metric assesses structural similarity while edit distance is used for direct text-based comparisons. Custom normalization for LaTeX tabular expressions mitigates bias from formatting discrepancies.
- Reading Order Edit: Quantifies sequence correctness in block/paragraph ordering, especially for non-linear layouts (multi-column, dense figure placements).
This multi-faceted evaluation protocol enables granular analysis, allowing for identification of model weaknesses in spatial understanding, order sensitivity, or semantic preservation.
4. Model Performance and Comparative Findings
Experiments demonstrate that the Logics-Parsing model achieves State-of-the-Art (SOTA) performance across LogicsParsingBench’s categories. Notably, in aggregate edit distance measures on both English and Chinese PDF samples, Logics-Parsing outperforms traditional pipeline systems and baseline LVLMs, with particularly marked improvements on the reading order and layout-sensitive metrics. The RL-augmented stage is crucial: the layout-centric optimization dramatically improves correct paragraph sequencing in multi-column and figure-dense layouts, a failure mode in prior LVLM models relying on heuristic reading sequence or token-level supervision.
In recognition of formula and table structure, Logics-Parsing achieves lower error rates compared to classical OCR-centric systems, owing to explicit integration of specialized annotated data during SFT. For chemical formula recognition, explicit SMILES string evaluation reflects the model’s capacity to generalize across domain-specific syntactic conventions.
5. Specialized Data Types and Annotation
Different from general document parsing benchmarks, LogicsParsingBench includes challenging annotations for:
- Mathematical formulas: Rendered in LaTeX, with normalization to account for equivalent mathematical expressions.
- Tables: Annotated for both content accuracy and structural fidelity (e.g., cell alignment, multi-row/column merging).
- Chemical diagrams: Parsed to SMILES representation for exact match evaluation against ground truth molecular structures.
- Handwritten Chinese: Annotated at character and block level, presenting substantial challenges in joint OCR and layout parsing due to variable stroke order and placement.
This comprehensive annotation protocol supports detailed error analysis and advances the state of evaluation in scientific and multilingual document parsing contexts.
6. Direct Impact and Research Directions
LogicsParsingBench establishes a standardized testbed for document parsing, enabling reproducible and comparable evaluation of end-to-end LVLM models across real-world scenarios. Its design highlights that high-quality layout and reading order parsing require reinforcement learning-based global optimization rather than token-level autoregressive supervision alone.
The findings suggest that incorporating multiple reward signals—textual, spatial, and sequential—delivers significant gains in reading order and complex layout fidelity. There remain unsolved challenges in extremely dense scientific layouts and handwritten content, motivating further advances in model architecture, data augmentation for rare layout types, and reward engineering.
LogicsParsingBench’s release (see https://github.com/alibaba/Logics-Parsing) provides resources for continued development, with dataset, codebase, and evaluation protocol publicly documented for extension and benchmarking in future document analysis innovations.