DocBank: A Benchmark Dataset for Document Layout Analysis (2006.01038v3)

Published 1 Jun 2020 in cs.CL

Abstract: Document layout analysis usually relies on computer vision models to understand documents while ignoring textual information that is vital to capture. Meanwhile, high quality labeled datasets with both visual and textual information are still insufficient. In this paper, we present \textbf{DocBank}, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the \LaTeX{} documents available on the arXiv.com. With DocBank, models from different modalities can be compared fairly and multi-modal approaches will be further investigated and boost the performance of document layout analysis. We build several strong baselines and manually split train/dev/test sets for evaluation. Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents. The DocBank dataset is publicly available at \url{https://github.com/doc-analysis/DocBank}.

Authors (7)

Minghao Li (44 papers)
Yiheng Xu (20 papers)
Lei Cui (43 papers)
Shaohan Huang (79 papers)
Furu Wei (292 papers)
Zhoujun Li (122 papers)
Ming Zhou (182 papers)

Citations (181)

View on Semantic Scholar

Summary

The paper introduces DocBank, a benchmark dataset with 500K token-level annotated pages for comprehensive document layout recognition.
The paper employs a weakly supervised method using LaTeX files from arXiv to automatically generate consistent annotations without extensive manual labor.
The paper demonstrates that models integrating visual and textual cues, like LayoutLM, significantly enhance performance in document layout analysis.

DocBank: A Benchmark Dataset for Document Layout Analysis

This paper introduces DocBank, an innovative benchmark dataset designed to improve document layout analysis. The dataset comprises 500K document pages, each annotated at a fine-grained token level, facilitating detailed document structure recognition. The authors assert that current document layout methodologies heavily rely on visual information, often neglecting the textual context inherent in these documents. DocBank aims to bridge this gap, providing a resource that enables the integration of both visual and textual data in layout analysis models.

The authors describe a weakly supervised approach to create DocBank, leveraging the readily available \LaTeX{} source files from arXiv.com. Through strategic manipulation of these files, they automatically generate annotations without the extensive labor typically associated with manual labeling. This process ensures consistency and depth in the dataset, supporting complex semantic units such as abstracts, authors, equations, and sections, among others.

The significance of DocBank is manifold:

Multi-Modal Integration: By providing a dataset that supports both NLP and computer vision approaches, DocBank allows researchers to compare models across different modalities directly. This fosters the development of multi-modal techniques that leverage both text and image data, potentially enhancing the performance of document layout analysis significantly.
Baselines and Evaluation: The authors establish strong baseline models for evaluating DocBank, including BERT, RoBERTa, and LayoutLM. Among these, LayoutLM, which incorporates both text and layout information through 2D position embeddings, outperforms the rest, highlighting the efficacy of integrating spatial information in document analysis.
Practical and Theoretical Implications: With its comprehensive annotations, DocBank holds the promise to guide future research in creating more robust models for document understanding. It can accelerate the development of systems capable of handling diverse document types, aiding applications in automated content extraction, document retrieval, and information management.
Future Prospects: DocBank opens avenues for exploring advanced network architectures that can further capitalize on the interplay between visual and textual data. The dataset's extendibility also implies that it can be adapted to accommodate documents in different languages and domains, broadening the scope of document layout analysis research.

In conclusion, DocBank stands as a valuable contribution to the field of document layout analysis, offering a scalable, richly annotated dataset conducive to innovative multi-modal research. Its development marks a step forward in addressing the complexities of document understanding, promising improvements in both academic research and real-world applications.

Related Papers

GitHub

GitHub - doc-analysis/DocBank: DocBank: A Benchmark Dataset for Document Layout Analysis (583 stars)