Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LMDX: Language Model-based Document Information Extraction and Localization (2309.10952v2)

Published 19 Sep 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs (LLM) have revolutionized NLP, improving state-of-the-art and exhibiting emergent capabilities across various tasks. However, their application in extracting information from visually rich documents, which is at the core of many document processing workflows and involving the extraction of key entities from semi-structured documents, has not yet been successful. The main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs, which is critical for high quality extraction, and the lack of a grounding mechanism to localize the predicted entities within the document. In this paper, we introduce LLM-based Document Information Extraction and Localization (LMDX), a methodology to reframe the document information extraction task for a LLM. LMDX enables extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. Finally, we apply LMDX to the PaLM 2-S and Gemini Pro LLMs and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  993–1003, October 2021.
  2. Docformerv2: Local features for document understanding, 2023.
  3. Language models are few-shot learners, 2020.
  4. Bertgrid: Contextualized embedding for 2d document representation and understanding, 2019.
  5. LAMBERT: Layout-aware language modeling for information extraction. In Document Analysis and Recognition – ICDAR 2021, pp.  532–547. Springer International Publishing, 2021. doi: 10.1007/978-3-030-86549-8_34. URL https://doi.org/10.1007%2F978-3-030-86549-8_34.
  6. Palm 2 technical report, 2023.
  7. Training compute-optimal large language models, 2022.
  8. BROS: A layout-aware pre-trained language model for understanding documents. CoRR, abs/2108.04539, 2021. URL https://arxiv.org/abs/2108.04539.
  9. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  10. Spatial dependency parsing for semi-structured document information extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  330–343, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.28. URL https://aclanthology.org/2021.findings-acl.28.
  11. Chargrid: Towards understanding 2d documents, 2018.
  12. Ocr-free document understanding transformer. In European Conference on Computer Vision (ECCV), 2022.
  13. FormNet: Structural encoding beyond sequential modeling in form document information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3735–3754, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.260. URL https://aclanthology.org/2022.acl-long.260.
  14. FormNetV2: Multimodal graph contrastive learning for form document information extraction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9011–9026, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.501. URL https://aclanthology.org/2023.acl-long.501.
  15. Representation learning for information extraction from form-like documents. In ACL, 2020.
  16. OpenAI. Gpt-4 technical report, 2023.
  17. Cloudscan - a configuration-free invoice analysis system using recurrent neural networks. In Proceedings of 2017 14th IAPR International Conference on Document Analysis and Recognition, pp.  406–413, United States, 2017. IEEE. ISBN 9781538635858. doi: 10.1109/ICDAR.2017.74.
  18. Cord: A consolidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS 2019, 2019.
  19. Going full-tilt boogie on document understanding with text-image-layout transformer. In Josep Lladós, Daniel Lopresti, and Seiichi Uchida (eds.), Document Analysis and Recognition – ICDAR 2021, pp.  732–747, Cham, 2021. Springer International Publishing. ISBN 978-3-030-86331-9.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  21. Text chunking using transformation-based learning. In Third Workshop on Very Large Corpora, 1995. URL https://aclanthology.org/W95-0107.
  22. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  23. Gpt-ner: Named entity recognition via large language models, 2023a.
  24. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  25. QueryForm: A simple zero-shot form entity query framework. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  4146–4159, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.255. URL https://aclanthology.org/2023.findings-acl.255.
  26. Vrdu: A benchmark for visually-rich document understanding. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, pp.  5184–5193, New York, NY, USA, 2023c. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599929. URL https://doi.org/10.1145/3580305.3599929.
  27. Finetuned language models are zero-shot learners, 2022.
  28. Ppn: Parallel pointer-based network for key information extraction with complex layouts, 2023.
  29. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) 2021, 2021.
  30. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1192–1200, 2020.
  31. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989. doi: 10.1137/0218082. URL https://doi.org/10.1137/0218082.
  32. Multimodal pre-training based on graph attention network for document understanding, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Vincent Perot (14 papers)
  2. Kai Kang (25 papers)
  3. Florian Luisier (6 papers)
  4. Guolong Su (12 papers)
  5. Xiaoyu Sun (34 papers)
  6. Ramya Sree Boppana (2 papers)
  7. Zilong Wang (99 papers)
  8. Jiaqi Mu (7 papers)
  9. Hao Zhang (947 papers)
  10. Nan Hua (14 papers)
  11. Zifeng Wang (78 papers)
  12. Chen-Yu Lee (48 papers)
Citations (20)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com