RealKIE: Five Novel Datasets for Enterprise Key Information Extraction (2403.20101v1)
Abstract: We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and legal data processing. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data and OCR outputs are available to download at https://indicodatasolutions.github.io/RealKIE/ code to reproduce the baselines will be available shortly.
- Longformer: The long-document transformer. arXiv:2004.05150.
- Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
- Hard negative mining for metric learning based zero-shot classification. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pages 524–531. Springer.
- Xdoc: Unified pre-training for cross-format document understanding. arXiv preprint arXiv:2210.02849.
- Transformer-xl: Attentive language models beyond a fixed-length context. ArXiv, abs/1901.02860.
- Results of the WNUT2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147, Copenhagen, Denmark. Association for Computational Linguistics.
- Docparser: End-to-end ocr-free information extraction from visually rich documents. arXiv preprint arXiv:2304.12484.
- FCC. 2023. About - fcc public inspection files. https://publicfiles.fcc.gov/about. (Accessed on 09/26/2023).
- A contract corpus for recognizing rights and obligations. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 2045–2053.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
- Cuad: An expert-annotated nlp dataset for legal contract review.
- Xavier Holt and Andrew Chisholm. 2018. Extracting structured data from invoices. In Proceedings of the Australasian Language Technology Association Workshop 2018, pages 53–59, Dunedin, New Zealand.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia.
- Indico Data. 2023. https://www.indicodata.ai/.
- Named entity recognition in long documents: An end-to-end case study in the legal domain. In 2022 IEEE International Conference on Big Data (Big Data), pages 2024–2033.
- Ocr-free document understanding transformer.
- Kofax. 2023. Omnipage server. https://www.kofax.com/products/omnipage/server. (Accessed: 2023-09-26).
- Named entity recognition in industrial tables using tabular language models. In Conference on Empirical Methods in Natural Language Processing.
- Yuta Koreeda and Christopher Manning. 2021. ContractNLI: A dataset for document-level natural language inference for contracts. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1907–1919, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Pavel Korobov. 2023. Pypdfium2: A python binding for pdfium. Python Package Index.
- Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700.
- Document understanding dataset and evaluation (dude).
- A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, page 75–76, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
- A benchmark for lease contract review. arXiv preprint arXiv:2010.10386.
- Hyperband: A novel bandit-based approach to hyperparameter optimization.
- Rethinking negative sampling for unlabeled entity problem in named entity recognition. ArXiv, abs/2108.11607.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Daniel Lopresti. 2008. Optical character recognition errors and their effects on natural language processing. In Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, AND ’08, page 9–16, New York, NY, USA. Association for Computing Machinery.
- Indicodatasolutions/finetune: 0.10.0.
- Microsoft. 2023. Ocr - optical character recognition. https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr. Accessed: 2023-09-26.
- ResourceContracts.org. http://www.resourcecontracts.org. [Online; accessed May 19, 2023].
- Efficient classification of long documents using transformers.
- {CORD}: A consolidated receipt dataset for post-{ocr} parsing. In Workshop on Document Intelligence at NeurIPS 2019.
- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition.
- SEC. 2023. Sec.gov | privacy information. https://www.sec.gov/privacy#dissemination. (Accessed on 09/26/2023).
- Kleister: Key information extraction datasets involving long documents with complex layouts. In Document Analysis and Recognition – ICDAR 2021, pages 564–579, Cham. Springer International Publishing.
- Text annotation handbook: A practical guide for machine learning projects.
- Assessing the impact of ocr quality on downstream nlp tasks. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH, pages 484–496. INSTICC, SciTePress.
- Cross-domain contract element extraction with a bi-directional feedback clause-element relation network. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
- VRDU: A benchmark for visually-rich document understanding. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM.
- Ontonotes: A unified relational semantic representation. International Journal of Semantic Computing, 7(03):405–419.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Benjamin Townsend (2 papers)
- Madison May (2 papers)
- Christopher Wells (1 paper)