DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis (2206.01062v1)

Published 2 Jun 2022 in cs.CV and cs.LG

Abstract: Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.

Authors (5)

Birgit Pfitzmann (2 papers)
Christoph Auer (15 papers)
Michele Dolfi (25 papers)
Ahmed S Nassar (1 paper)
Peter W J Staar (3 papers)

Citations (65)

View on Semantic Scholar

Summary

The paper presents a large human-annotated dataset of 80,863 pages spanning diverse document types beyond scientific articles.
It leverages manual annotation in the COCO format across 11 layout classes to capture real-world layout variability.
Benchmark tests with models like Faster R-CNN, Mask R-CNN, and YOLOv5 highlight a performance gap with human agreement, underscoring future research opportunities.

Insights into DocLayNet: A Comprehensive Dataset for Document Layout Analysis

The paper "DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis" addresses a critical challenge in the field of document conversion: achieving accurate document layout analysis across diverse and complex layouts. Existing datasets like PubLayNet and DocBank, which primarily utilize data from scientific articles, suffer from limited layout variability. This paper introduces DocLayNet, a publicly available dataset designed to enhance the robustness and accuracy of document layout analysis models by incorporating a wider range of document types with greater complexity.

Dataset Composition and Features

DocLayNet distinguishes itself by offering manually annotated layout data for 80,863 pages from various document sources beyond the scientific domain. These include technical manuals, financial reports, legal documents, patents, and government tenders. The dataset is based on the COCO format and provides 11 distinct layout classes such as Caption, Footnote, and Section-header, among others. Human annotation ensures that the dataset accurately captures natural layout variability, which is often compromised in automatically generated datasets.

Furthermore, DocLayNet includes multiple annotations for a subset of pages to assess inter-annotator agreement, revealing that trained models often perform within approximately 10% of human annotators' agreement scores, as observed in the experiments.

Performance Baselines

The paper evaluates several object detection models, including Faster R-CNN, Mask R-CNN, and YOLOv5, to establish baseline performance metrics. Notably, these models show a significant performance gap when compared to human annotations, highlighting the complexity and challenge posed by the DocLayNet dataset. YOLOv5 emerged as a leading model in this paper, outperforming human scores on certain labels such as Text, Table, and Picture.

Implications and Future Directions

The introduction of DocLayNet has substantial implications for both practical and theoretical advancements in document layout analysis. By providing a diverse dataset with high variability in layout styles, the research community can develop models that are better equipped to handle the wide-ranging document types encountered in real-world applications. This is crucial for industries reliant on accurate digitization and analysis of document content.

Looking ahead, the paper suggests that future work should focus on enhancing data consistency and exploring novel data augmentation strategies to further improve model performance. The expanded use of DocLayNet-trained models may facilitate more resilient layout prediction methodologies that are applicable across various domains and document types.

Overall, the DocLayNet dataset represents a significant contribution to the field, offering a robust foundation for developing more generalizable and accurate document layout analysis techniques. The dataset's availability will undoubtedly catalyze innovation and progress in machine learning approaches to document processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/hu_yifei/status/1903938590750781633

https://twitter.com/sebuzdugan/status/1929546968411312221