Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BuDDIE: A Business Document Dataset for Multi-task Information Extraction (2404.04003v1)

Published 5 Apr 2024 in cs.CL

Abstract: The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific of documents or task is not representative of how documents often need to be processed in the wild - where variety in style and requirements is expected. In this paper, we introduce BuDDIE (Business Document Dataset for Information Extraction), the first multi-task dataset of 1,665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and LLM approaches to VRDU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. VQA: Visual Question Answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 2425–2433. https://doi.org/10.1109/ICCV.2015.279
  2. DocFormer: End-to-End Transformer for Document Understanding. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 973–983. https://doi.org/10.1109/ICCV48922.2021.00103
  3. Ron Artstein and Massimo Poesio. 2008. Survey Article: Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34, 4 (2008), 555–596. https://doi.org/10.1162/coli.07-034-R2
  4. DUE: End-to-End Document Understanding Benchmark. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/069059b7ef840f0c74a814ec9237b6ec-Abstract-round2.html
  5. Deep Visual Template-Free Form Parsing. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 134–141. https://doi.org/10.1109/ICDAR.2019.00030
  6. Visual FUDGE: Form Understanding via Dynamic Graph Editing. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12821), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer, 416–431. https://doi.org/10.1007/978-3-030-86549-8_27
  7. End-to-End Document Recognition and Understanding with Dessurt. In Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV (Lecture Notes in Computer Science, Vol. 13804), Leonid Karlinsky, Tomer Michaeli, and Ko Nishino (Eds.). Springer, 280–296. https://doi.org/10.1007/978-3-031-25069-9_19
  8. Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. CoRR abs/1909.04948 (2019). arXiv:1909.04948 http://arxiv.org/abs/1909.04948
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
  10. LAMBERT: Layout-Aware Language Modeling for Information Extraction. In Document Analysis and Recognition – ICDAR 2021, Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer International Publishing, Cham, 532–547.
  11. Evaluation of deep convolutional nets for document image classification and retrieval. In 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Nancy, France, August 23-26, 2015. IEEE Computer Society, 991–995. https://doi.org/10.1109/ICDAR.2015.7333910
  12. BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 10767–10775. https://ojs.aaai.org/index.php/AAAI/article/view/21322
  13. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, João Magalhães, Alberto Del Bimbo, Shin’ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 4083–4091. https://doi.org/10.1145/3503161.3548112
  14. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1516–1520. https://doi.org/10.1109/ICDAR.2019.00244
  15. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In 2nd International Workshop on Open Services and Tools for Document Analysis, OST@ICDAR 2019, Sydney, Australia, September 22-25, 2019. IEEE, 1–6. https://doi.org/10.1109/ICDARW.2019.10029
  16. SLIDE - a Sentiment Lexicon of Common Idioms. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1379
  17. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1601–1611. https://doi.org/10.18653/v1/P17-1147
  18. OCR-Free Document Understanding Transformer. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII (Lecture Notes in Computer Science, Vol. 13688), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 498–517. https://doi.org/10.1007/978-3-031-19815-1_29
  19. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
  20. Document Understanding Dataset and Evaluation (DUDE). CoRR abs/2305.08455 (2023). https://doi.org/10.48550/arXiv.2305.08455 arXiv:2305.08455
  21. FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 9011–9026. https://doi.org/10.18653/v1/2023.acl-long.501
  22. StructuralLM: Structural Pre-training for Form Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6309–6318. https://doi.org/10.18653/v1/2021.acl-long.493
  23. ViBERTgrid: A Jointly Trained Multi-modal 2D Document Representation for Key Information Extraction from Documents. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12821), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer, 548–563. https://doi.org/10.1007/978-3-030-86549-8_35
  24. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR (2020). https://openreview.net/forum?id=SyxS0T4tvS
  25. DocVQA: A Dataset for VQA on Document Images. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021. IEEE, 2199–2208. https://doi.org/10.1109/WACV48630.2021.00225
  26. PAWLS: PDF Annotation With Labels and Structure. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 258–264. https://doi.org/10.18653/v1/2021.acl-demo.31
  27. CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019.
  28. ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 3744–3756. https://aclanthology.org/2022.findings-emnlp.274
  29. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12822), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer, 732–747. https://doi.org/10.1007/978-3-030-86331-9_47
  30. DuReadervissubscriptDuReadervis\textrm{DuReader}_{\textrm{vis}}DuReader start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT: A Chinese Dataset for Open-domain Document Visual Question Answering. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 1338–1351. https://doi.org/10.18653/v1/2022.findings-acl.105
  31. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264
  32. DocILE Benchmark for Document Information Localization and Extraction. CoRR abs/2302.05658 (2023). https://doi.org/10.48550/arXiv.2302.05658 arXiv:2302.05658
  33. Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts. In Document Analysis and Recognition – ICDAR 2021, Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer International Publishing, Cham, 564–579.
  34. Unifying Vision, Text, and Layout for Universal Document Processing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 19254–19264. https://doi.org/10.1109/CVPR52729.2023.01845
  35. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
  36. Hieu M. Vu and Diep Thi-Ngoc Nguyen. 2020. Revising FUNSD dataset for key-value detection in document images. CoRR abs/2010.05322 (2020). arXiv:2010.05322 https://arxiv.org/abs/2010.05322
  37. DocLLM: A layout-aware generative language model for multimodal document understanding. arXiv:2401.00908 [cs.CL]
  38. DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 898–908. https://doi.org/10.18653/v1/2020.findings-emnlp.80
  39. VRDU: A Benchmark for Visually-Rich Document Understanding. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Long Beach, CA, USA) (KDD ’23). Association for Computing Machinery, New York, NY, USA, 5184–5193. https://doi.org/10.1145/3580305.3599929
  40. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1192–1200. https://doi.org/10.1145/3394486.3403172
  41. XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 3214–3224. https://doi.org/10.18653/v1/2022.findings-acl.253
  42. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 2579–2591. https://doi.org/10.18653/v1/2021.acl-long.201
  43. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 2013–2018. https://doi.org/10.18653/v1/D15-1237
  44. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2369–2380. https://doi.org/10.18653/v1/D18-1259
  45. mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. arXiv:2307.02499 [cs.CL]
  46. TRIE: End-to-End Text Reading and Information Extraction for Document Understanding. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.). ACM, 1413–1422. https://doi.org/10.1145/3394171.3413900
  47. Multimodal Pre-training Based on Graph Attention Network for Document Understanding. CoRR abs/2203.13530 (2022). https://doi.org/10.48550/arXiv.2203.13530 arXiv:2203.13530
  48. TreeForm: End-to-end Annotation and Evaluation for Form Document Parsing. In Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII), Sophie Henning and Manfred Stede (Eds.). Association for Computational Linguistics, St. Julians, Malta, 1–11. https://aclanthology.org/2024.law-1.1
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Ran Zmigrod (17 papers)
  2. Dongsheng Wang (47 papers)
  3. Mathieu Sibue (5 papers)
  4. Yulong Pei (31 papers)
  5. Petr Babkin (6 papers)
  6. Ivan Brugere (21 papers)
  7. Xiaomo Liu (17 papers)
  8. Nacho Navarro (3 papers)
  9. Antony Papadimitriou (3 papers)
  10. William Watson (10 papers)
  11. Zhiqiang Ma (19 papers)
  12. Armineh Nourbakhsh (18 papers)
  13. Sameena Shah (33 papers)
Citations (2)