BuDDIE: A Business Document Dataset for Multi-task Information Extraction (2404.04003v1)
Abstract: The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific of documents or task is not representative of how documents often need to be processed in the wild - where variety in style and requirements is expected. In this paper, we introduce BuDDIE (Business Document Dataset for Information Extraction), the first multi-task dataset of 1,665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and LLM approaches to VRDU.
- VQA: Visual Question Answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 2425–2433. https://doi.org/10.1109/ICCV.2015.279
- DocFormer: End-to-End Transformer for Document Understanding. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 973–983. https://doi.org/10.1109/ICCV48922.2021.00103
- Ron Artstein and Massimo Poesio. 2008. Survey Article: Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34, 4 (2008), 555–596. https://doi.org/10.1162/coli.07-034-R2
- DUE: End-to-End Document Understanding Benchmark. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/069059b7ef840f0c74a814ec9237b6ec-Abstract-round2.html
- Deep Visual Template-Free Form Parsing. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 134–141. https://doi.org/10.1109/ICDAR.2019.00030
- Visual FUDGE: Form Understanding via Dynamic Graph Editing. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12821), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer, 416–431. https://doi.org/10.1007/978-3-030-86549-8_27
- End-to-End Document Recognition and Understanding with Dessurt. In Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV (Lecture Notes in Computer Science, Vol. 13804), Leonid Karlinsky, Tomer Michaeli, and Ko Nishino (Eds.). Springer, 280–296. https://doi.org/10.1007/978-3-031-25069-9_19
- Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. CoRR abs/1909.04948 (2019). arXiv:1909.04948 http://arxiv.org/abs/1909.04948
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
- LAMBERT: Layout-Aware Language Modeling for Information Extraction. In Document Analysis and Recognition – ICDAR 2021, Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer International Publishing, Cham, 532–547.
- Evaluation of deep convolutional nets for document image classification and retrieval. In 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Nancy, France, August 23-26, 2015. IEEE Computer Society, 991–995. https://doi.org/10.1109/ICDAR.2015.7333910
- BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 10767–10775. https://ojs.aaai.org/index.php/AAAI/article/view/21322
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, João Magalhães, Alberto Del Bimbo, Shin’ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 4083–4091. https://doi.org/10.1145/3503161.3548112
- ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1516–1520. https://doi.org/10.1109/ICDAR.2019.00244
- FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In 2nd International Workshop on Open Services and Tools for Document Analysis, OST@ICDAR 2019, Sydney, Australia, September 22-25, 2019. IEEE, 1–6. https://doi.org/10.1109/ICDARW.2019.10029
- SLIDE - a Sentiment Lexicon of Common Idioms. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1379
- TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1601–1611. https://doi.org/10.18653/v1/P17-1147
- OCR-Free Document Understanding Transformer. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII (Lecture Notes in Computer Science, Vol. 13688), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 498–517. https://doi.org/10.1007/978-3-031-19815-1_29
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
- Document Understanding Dataset and Evaluation (DUDE). CoRR abs/2305.08455 (2023). https://doi.org/10.48550/arXiv.2305.08455 arXiv:2305.08455
- FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 9011–9026. https://doi.org/10.18653/v1/2023.acl-long.501
- StructuralLM: Structural Pre-training for Form Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6309–6318. https://doi.org/10.18653/v1/2021.acl-long.493
- ViBERTgrid: A Jointly Trained Multi-modal 2D Document Representation for Key Information Extraction from Documents. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12821), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer, 548–563. https://doi.org/10.1007/978-3-030-86549-8_35
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR (2020). https://openreview.net/forum?id=SyxS0T4tvS
- DocVQA: A Dataset for VQA on Document Images. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021. IEEE, 2199–2208. https://doi.org/10.1109/WACV48630.2021.00225
- PAWLS: PDF Annotation With Labels and Structure. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 258–264. https://doi.org/10.18653/v1/2021.acl-demo.31
- CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019.
- ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 3744–3756. https://aclanthology.org/2022.findings-emnlp.274
- Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12822), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer, 732–747. https://doi.org/10.1007/978-3-030-86331-9_47
- DuReadervissubscriptDuReadervis\textrm{DuReader}_{\textrm{vis}}DuReader start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT: A Chinese Dataset for Open-domain Document Visual Question Answering. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 1338–1351. https://doi.org/10.18653/v1/2022.findings-acl.105
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264
- DocILE Benchmark for Document Information Localization and Extraction. CoRR abs/2302.05658 (2023). https://doi.org/10.48550/arXiv.2302.05658 arXiv:2302.05658
- Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts. In Document Analysis and Recognition – ICDAR 2021, Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer International Publishing, Cham, 564–579.
- Unifying Vision, Text, and Layout for Universal Document Processing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 19254–19264. https://doi.org/10.1109/CVPR52729.2023.01845
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
- Hieu M. Vu and Diep Thi-Ngoc Nguyen. 2020. Revising FUNSD dataset for key-value detection in document images. CoRR abs/2010.05322 (2020). arXiv:2010.05322 https://arxiv.org/abs/2010.05322
- DocLLM: A layout-aware generative language model for multimodal document understanding. arXiv:2401.00908 [cs.CL]
- DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 898–908. https://doi.org/10.18653/v1/2020.findings-emnlp.80
- VRDU: A Benchmark for Visually-Rich Document Understanding. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Long Beach, CA, USA) (KDD ’23). Association for Computing Machinery, New York, NY, USA, 5184–5193. https://doi.org/10.1145/3580305.3599929
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1192–1200. https://doi.org/10.1145/3394486.3403172
- XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 3214–3224. https://doi.org/10.18653/v1/2022.findings-acl.253
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 2579–2591. https://doi.org/10.18653/v1/2021.acl-long.201
- WikiQA: A Challenge Dataset for Open-Domain Question Answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 2013–2018. https://doi.org/10.18653/v1/D15-1237
- HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2369–2380. https://doi.org/10.18653/v1/D18-1259
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. arXiv:2307.02499 [cs.CL]
- TRIE: End-to-End Text Reading and Information Extraction for Document Understanding. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.). ACM, 1413–1422. https://doi.org/10.1145/3394171.3413900
- Multimodal Pre-training Based on Graph Attention Network for Document Understanding. CoRR abs/2203.13530 (2022). https://doi.org/10.48550/arXiv.2203.13530 arXiv:2203.13530
- TreeForm: End-to-end Annotation and Evaluation for Form Document Parsing. In Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII), Sophie Henning and Manfred Stede (Eds.). Association for Computational Linguistics, St. Julians, Malta, 1–11. https://aclanthology.org/2024.law-1.1
- Ran Zmigrod (17 papers)
- Dongsheng Wang (47 papers)
- Mathieu Sibue (5 papers)
- Yulong Pei (31 papers)
- Petr Babkin (6 papers)
- Ivan Brugere (21 papers)
- Xiaomo Liu (17 papers)
- Nacho Navarro (3 papers)
- Antony Papadimitriou (3 papers)
- William Watson (10 papers)
- Zhiqiang Ma (19 papers)
- Armineh Nourbakhsh (18 papers)
- Sameena Shah (33 papers)