Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DocILE Benchmark for Document Information Localization and Extraction (2302.05658v2)

Published 11 Feb 2023 in cs.CL, cs.AI, and cs.LG

Abstract: This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Štěpán Šimsa (5 papers)
  2. Milan Šulc (11 papers)
  3. Michal Uřičář (11 papers)
  4. Yash Patel (41 papers)
  5. Ahmed Hamdi (4 papers)
  6. Matěj Kocián (2 papers)
  7. Matyáš Skalický (3 papers)
  8. Antoine Doucet (18 papers)
  9. Mickaël Coustaty (15 papers)
  10. Dimosthenis Karatzas (80 papers)
  11. Jiří Matas (27 papers)
Citations (24)

Summary

  • The paper introduces the DocILE benchmark to advance key information localization and extraction with a diverse, large-scale dataset.
  • The dataset includes 6.7k annotated, 100k synthetic, and nearly 1M unlabeled documents, enabling robust training for document AI models.
  • Baseline experiments show that multimodal models like LayoutLMv3 outperform text-only approaches in both KILE and LIR tasks.

DocILE Benchmark for Document Information Localization and Extraction

The paper introduces the DocILE benchmark, a comprehensive dataset aimed at advancing research in Key Information Localization and Extraction (KILE) as well as Line Item Recognition (LIR) from business documents. This dataset is notable for its scale, comprising a mixture of annotated, synthetically generated, and unlabeled documents. Specifically, the dataset includes 6.7k annotated business documents, 100k synthetic documents, and nearly 1 million unlabeled documents, offering a wide array of training possibilities for models in document processing tasks.

Key Features of the DocILE Dataset

  1. Annotation Diversity: The benchmark offers annotations across 55 classes, which provides much greater granularity compared to other datasets in the field. The diversity in annotation classes is essential for detailed information extraction and alignment with real-world business document variations.
  2. Focus on Line Item Recognition: Unlike previous datasets, DocILE explicitly targets LIR, a critical task where key information must be correctly associated with items within tables. This effort aligns with practical applications, such as automating invoice processing, order verification, and similar tasks.
  3. Variety of Document Layouts: The benchmark includes documents originating from various layouts. The test set features both zero-shot and few-shot instances, ensuring that models are evaluated on their ability to generalize to unseen or sparsely seen layouts.
  4. Synthetic and Unlabeled Data: DocILE includes synthetic documents generated to mimic real-world variability while maintaining controlled annotations. This allows for robust training and evaluation scenarios. The inclusion of unlabeled documents enables unsupervised pre-training approaches to further enhance model performance.
  5. Baseline Models: To aid researchers, the paper provides baseline performances using models such as RoBERTa, LayoutLMv3, and DETR-based Table Transformer. These models serve as a benchmark for future enhancements and approaches in document information extraction.

Experimental Findings

The baseline evaluations indicated several insights:

  • The RoBERTa model achieved competitive results in both KILE and LIR tasks, demonstrating the efficacy of text-only approaches even in document contexts that include spatial information.
  • LayoutLMv3, which merges image and text representations, showed superior performance, highlighting the benefits of integrating multimodal inputs.
  • Pre-training on the synthetic subset improved performance across different models, showcasing the utility of synthetic data in enhancing model learning before applying them to real-world data.

Implications and Future Directions

The introduction of the DocILE benchmark holds substantial implications for both academic research and industrial applications. By offering a detailed and large-scale dataset, the benchmark sets a new standard for evaluating and developing document understanding models. Future research can extend into areas such as exploring more sophisticated multi-modal learning strategies, zero-shot learning capabilities, and higher resilience to document layout variability.

In a broader context, the benchmark supports the shift towards more automated systems capable of handling the intricacy and diversity of business document formats. This capability will inevitably lead to efficiency improvements in business processes reliant on document handling, offering potential cost savings and accuracy benefits.

Overall, DocILE provides a fertile testing ground for advancing both theoretical and practical aspects of document AI. As methodologies evolve, the benchmark will serve as a valuable resource in measuring progress within the field and unlocking new capabilities in intelligent document processing technologies.

Github Logo Streamline Icon: https://streamlinehq.com