PubLayNet: largest dataset ever for document layout analysis (1908.07836v1)

Published 16 Aug 2019 in cs.CL

Abstract: Recognizing the layout of unstructured digital documents is an important step when parsing the documents into structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset. In this paper, we develop the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated. The experiments demonstrate that deep neural networks trained on PubLayNet accurately recognize the layout of scientific articles. The pre-trained models are also a more effective base mode for transfer learning on a different document domain. We release the dataset (https://github.com/ibm-aur-nlp/PubLayNet) to support development and evaluation of more advanced models for document layout analysis.

Authors (3)

Xu Zhong (7 papers)
Jianbin Tang (12 papers)
Antonio Jimeno Yepes (23 papers)

Citations (410)

View on Semantic Scholar

Summary

Analysis of "PubLayNet: largest dataset ever for document layout analysis"

The paper "PubLayNet: largest dataset ever for document layout analysis" introduces a substantial contribution to document layout analysis by presenting the PubLayNet dataset. This dataset stands as the most extensive resource for document layout analysis, enabling advanced training of deep learning models. The researchers have constructed PubLayNet by leveraging the XML and PDF representations of documents from PubMed Central™, resulting in more than 360,000 annotated document images.

Dataset Construction and Annotation

PubLayNet is an automatically generated dataset created by matching the XML representations of over one million articles with their corresponding PDF contents. The dataset encompasses typical layout elements such as text, titles, lists, figures, and tables. The automated annotation process utilizes a sophisticated method whereby discrepancies between XML and PDF representations are reconciled, ensuring annotations are of high quality, verified to be accurate to a stringent threshold of 99%.

Experimental Results and Model Performance

The authors conducted several experiments using state-of-the-art deep learning models, specifically Faster-RCNN and Mask-RCNN. These models exhibited high precision in recognizing document layouts, achieving macro average MAP scores exceeding 0.9. Mask-RCNN showcased slightly superior performance compared to Faster-RCNN. The dataset was apportioned into training, development, and testing subsets, with the models demonstrating strong generalization capabilities in unseen document templates.

Transfer Learning and Domain Application

The paper explores transfer learning, showing that models pre-trained on PubLayNet can effectively adapt to different document domains, including the ICDAR 2013 Table Recognition Competition. Fine-tuning the models achieved a performance on par with the state-of-the-art, despite using significantly fewer training samples (170 samples). Additionally, the paper evaluated domain adaptation to insurance-related documents. Fine-tuning pre-trained models on PubLayNet substantially enhanced performance, outperforming models initialized on general image datasets like ImageNet and COCO.

Implications and Future Work

The release of PubLayNet offers broad implications for both the practical and theoretical advancement of document layout analysis. It provides a robust foundation for the development of more sophisticated models and transfer learning applications to various document types beyond scientific literature. Future endeavors may include exploring logical document structures and relationships, further expanding the capabilities of document understanding systems.

In conclusion, PubLayNet addresses the significant challenge posed by the lack of adequately large datasets in document layout analysis. It stands to facilitate advancements in the training of neural networks, enhancing layout understanding across diverse document types and domains. The dataset is accessible at https://github.com/ibm-aur-nlp/PubLayNet, promising further contributions to the field of document processing and analysis.

PDF Markdown

Related Papers

GitHub

GitHub - ibm-aur-nlp/PubLayNet (855 stars)

Tweets

https://twitter.com/srush_nlp/status/1253786329575538691

https://twitter.com/srush_nlp/status/1244426222349758466

https://twitter.com/oferlavi/status/1368327436404981761

https://twitter.com/_tlyim/status/1263755479659900928

https://twitter.com/asifrazzaq1988/status/1255355002522087427