M$^{6}$Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis (2305.08719v2)

Published 15 May 2023 in cs.CV

Abstract: Document layout analysis is a crucial prerequisite for document understanding, including document retrieval and conversion. Most public datasets currently contain only PDF documents and lack realistic documents. Models trained on these datasets may not generalize well to real-world scenarios. Therefore, this paper introduces a large and diverse document layout analysis dataset called $M^{6}Doc$. The $M^6$ designation represents six properties: (1) Multi-Format (including scanned, photographed, and PDF documents); (2) Multi-Type (such as scientific articles, textbooks, books, test papers, magazines, newspapers, and notes); (3) Multi-Layout (rectangular, Manhattan, non-Manhattan, and multi-column Manhattan); (4) Multi-Language (Chinese and English); (5) Multi-Annotation Category (74 types of annotation labels with 237,116 annotation instances in 9,080 manually annotated pages); and (6) Modern documents. Additionally, we propose a transformer-based document layout analysis method called TransDLANet, which leverages an adaptive element matching mechanism that enables query embedding to better match ground truth to improve recall, and constructs a segmentation branch for more precise document image instance segmentation. We conduct a comprehensive evaluation of $M^{6}Doc$ with various layout analysis methods and demonstrate its effectiveness. TransDLANet achieves state-of-the-art performance on $M^{6}Doc$ with 64.5% mAP. The $M^{6}Doc$ dataset will be available at https://github.com/HCIILAB/M6Doc.

Authors (9)

Hiuyi Cheng (2 papers)
Peirong Zhang (10 papers)
Sihang Wu (2 papers)
Jiaxin Zhang (105 papers)
Qiyuan Zhu (7 papers)
Zecheng Xie (12 papers)
Jing Li (621 papers)
Kai Ding (29 papers)
Lianwen Jin (116 papers)

Citations (19)

View on Semantic Scholar

Summary

An Analysis of M $^{6}$ Doc Dataset and TransDLANet for Document Layout Analysis

The paper "M $^{6}$ Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis" introduces a significant contribution to the domain of document layout analysis (DLA), presenting both a comprehensive dataset and a novel model architecture aimed at enhancing the efficacy of DLA tasks. This work addresses notable gaps in extant datasets and proposes a robust methodology to improve document layout understanding.

The M $^{6}$ Doc Dataset

The primary advancement of this work is the introduction of the $M^{6}$ Doc dataset. It is engineered to be comprehensive and diverse, encapsulating six critical properties that are essential for robust DLA:

Multi-Format: Includes diverse document forms such as scanned, photographed, and PDF files, reflecting the variance found in real-world scenarios.
Multi-Type: Comprises numerous document genres, such as scientific articles, textbooks, newspapers, and notes, allowing for validation across varied content types.
Multi-Layout: Encompasses diverse layout structures from simple rectangular to complex multi-column Manhattan layouts.
Multi-Language: Supports languages like Chinese and English, addressing the issue of domain shifts caused by language changes.
Multi-Annotation Category: Provides an extensive set of 74 annotation types, supporting fine-grained analysis.
Modern Documents: Focused on contemporary documents to ensure relevance to current standards of document production.

With 237,116 annotation instances across 9,080 manually annotated pages, this dataset lays a foundation for advanced DLA research and application.

TransDLANet Methodology

The research proposes a novel transformer-based model, TransDLANet, for document layout analysis. The methodological advancements in TransDLANet are characterized by:

Utilization of a transformer encoder without positional encoding to capture the correlation among various document instances.
An adaptive element matching mechanism designed to enhance recall by aligning query embeddings closely with ground truth.
Incorporation of a dynamic interaction decoder to improve the precision of segmentation through robust fusion of region-of-interest (RoI) and image features.
Implementation of shared multi-layer perception branches for efficient multi-task learning.

TransDLANet achieved state-of-the-art results on the $M^{6}$ Doc dataset, manifesting a mAP of 64.5%, illustrating its efficacy over traditional and contemporary DLA methods.

Implications and Future Directions

The implications of $M^{6}$ Doc and TransDLANet extend both practically and theoretically. On a practical level, the dataset's diversity positions it as a benchmark for developing models that require generalization across various document types and formats. The detailed annotation categories support applications in logical layout analysis, formula recognition, and table analysis.

Theoretically, the work establishes critical groundwork for exploring the integration of visual and semantic features in multi-modal DLA tasks. Future work could delve into refining TransDLANet for enhanced recall rates and performance on handwritten documents, fortifying its ability to handle diverse real-world scenarios.

In conclusion, $M^{6}$ Doc and TransDLANet represent important advancements in the field of document layout analysis, offering both a comprehensive resource and a cutting-edge methodological approach to facilitate further exploration and innovation in the domain.

PDF Markdown

Related Papers

GitHub

GitHub - HCIILAB/M6Doc (107 stars)