Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DocLLM: A layout-aware generative language model for multimodal document understanding (2401.00908v1)

Published 31 Dec 2023 in cs.CL
DocLLM: A layout-aware generative language model for multimodal document understanding

Abstract: Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional LLMs for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

Introduction

Documents like forms and contracts represent a substantial part of data within enterprises. These documents are not just about the text they contain but also about how the text is organized spatially. Understanding such documents computationally presents a challenge due to their complex layouts and varied formats. While significant progress has been made in Document AI (DocAI) for tasks like extraction and classification, issues remain with accurately processing real-world applications. Conventional LLMs, such as GPT or BERT, generally handle text input and often fall short when dealing with multimodal data, such as documents that integrate text with visual layout cues. Multimodal LLMs that factor in both visual and textual information typically require complex image encoders, which can be resource-intensive.

The DocLLM Framework

The paper presents DocLLM, a refined model that extends the capabilities of LLMs to multimodal document understanding. This model integrates spatial layout via bounding box data, capturing the cross-modal alignment between text and visual features without the need for image encoders. By expanding the traditional attention mechanism in transformers to account for spatial information, DocLLM can disentangle textual semantics from spatial structure, providing focused analysis when needed.

The innovation does not stop there; the pre-training objective is adapted to be more suitable for the unique traits of visual documents. Instead of the conventional token-by-token prediction, DocLLM is designed to predict blocks of text, considering the preceding and succeeding contexts, which is more aligned with the way humans process information in documents.

Additionally, DocLLM undergoes instruction-tuning on an extensive dataset covering multiple document intelligence tasks, enhancing its performance by utilizing layout hints such as separators and captions within the instructions. The experimental results reveal that DocLLM outperforms SotA models on a range of document analysis tasks, displaying noteworthy generalization capabilities.

Experiments and Results

The model is rigorously evaluated through various experiments. The evaluation encompasses two perspectives: assessing performance on the same types of documents but with different data splits, and analyzing generalizability to entirely unseen datasets. In both scenarios, DocLLM shows robust capabilities, surpassing equivalent models in numerous tasks.

Tests are performed against other leading models, including benchmarks like Llama and GPT-4, indicating that DocLLM's architecture is beneficial across document types and tasks. Notably, its adeptness at handling documents with a mix of image and text information, thanks to its spatial attention mechanism, positions it as a favorable choice for enterprises dealing with diverse document types.

Ablation Studies

Delving deeper into the workings of DocLLM, ablation studies highlight the significance of the model's components. The studies examine the added value of spatial attention, confirming that incorporating layouts through disentangled attention improves the model's understanding. They compare the performance of block infilling objectives against traditional causal learning to prove the effectiveness of DocLLM's approach towards pre-training. The predictive power of the causal decoder configuration further validates the chosen architecture for downstream tasks.

Conclusion and Future Work

DocLLM showcases an evolved approach to document processing, blending the complex spatial layouts with the linguistic parsing LLMs are known for. Its design choices circumvent the need for the complex vision encoders found in many multimodal LLMs while providing comparable, if not superior, performance. The paper emphasizes the model's potential applications and suggests future enhancements, like incorporating vision more seamlessly into the DocLLM framework, expanding its reach within the document intelligence sphere.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Arjun Reddy Kunduru. From data entry to intelligence: Artificial intelligence’s impact on financial system workflows. International Journal on Orange Technologies, 5(8):38–45, 2023.
  2. Document ai: Benchmarks, models and applications. arXiv preprint arXiv:2111.08609, 2021.
  3. Language models are few-shot learners, 2020.
  4. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  5. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  6. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  7. Dit: Self-supervised pre-training for document image transformer. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3530–3539, 2022.
  8. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  10. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, Online, August 2021. Association for Computational Linguistics.
  11. FormNet: Structural encoding beyond sequential modeling in form document information extraction. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3735–3754, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  12. Unifying vision, text, and layout for universal document processing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 19254–19264. IEEE, 2023.
  13. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200, 2020.
  14. Connecting what to say with where to look by modeling human attention traces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12679–12688, 2021.
  15. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
  16. OpenAI. Gpt-4 technical report, 2023.
  17. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  18. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  20. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  21. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  22. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  23. mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023.
  24. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  25. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  28. Root mean square layer normalization, 2019.
  29. Noam Shazeer. Glu variants improve transformer, 2020.
  30. Roformer: Enhanced transformer with rotary position embedding, 2023.
  31. mplug-docowl: Modularized multimodal large language model for document understanding. CoRR, abs/2307.02499, 2023.
  32. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. CoRR, abs/2310.05126, 2023.
  33. Ocr-free document understanding transformer. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, page 498–517, Berlin, Heidelberg, 2022. Springer-Verlag.
  34. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 18893–18912. PMLR, 2023.
  35. Ocr-free document understanding transformer, 2022.
  36. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
  37. Film: Fill-in language models for any-order generation. arXiv preprint arXiv:2310.09930, 2023.
  38. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2020.
  39. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  40. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  41. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
  42. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  43. Docvqa: A dataset for VQA on document images. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021, pages 2199–2208. IEEE, 2021.
  44. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1470–1480. The Association for Computer Linguistics, 2015.
  45. Visualmrc: Machine reading comprehension on document images. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 13878–13888. AAAI Press, 2021.
  46. Document understanding dataset and evaluation (DUDE). CoRR, abs/2305.08455, 2023.
  47. Tabfact: A large-scale dataset for table-based fact verification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  48. Kleister: Key information extraction datasets involving long documents with complex layouts. In Josep Lladós, Daniel Lopresti, and Seiichi Uchida, editors, 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I, volume 12821 of Lecture Notes in Computer Science, pages 564–579. Springer, 2021.
  49. Cord: A consolidated receipt dataset for post-ocr parsing. 2019.
  50. FUNSD: A dataset for form understanding in noisy scanned documents. In 2nd International Workshop on Open Services and Tools for Document Analysis, OST@ICDAR 2019, Sydney, Australia, September 22-25, 2019, pages 1–6. IEEE, 2019.
  51. Stacey Svetlichnaya. Deepform: Understand structured documents at scale. 2020.
  52. AxCell: Automatic extraction of results from machine learning papers. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8580–8594, Online, November 2020. Association for Computational Linguistics.
  53. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520, 2019.
  54. VRDU: A benchmark for visually-rich document understanding. In Ambuj K. Singh, Yizhou Sun, Leman Akoglu, Dimitrios Gunopulos, Xifeng Yan, Ravi Kumar, Fatma Ozcan, and Jieping Ye, editors, Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 5184–5193. ACM, 2023.
  55. Evaluation of deep convolutional nets for document image classification and retrieval. In 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Nancy, France, August 23-26, 2015, pages 991–995. IEEE Computer Society, 2015.
  56. Building a test collection for complex document information processing. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, page 665–666, New York, NY, USA, 2006. Association for Computing Machinery.
  57. DocBank: A benchmark dataset for document layout analysis. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 949–960, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.
  58. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  59. End-to-end document recognition and understanding with dessurt. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors, Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV, volume 13804 of Lecture Notes in Computer Science, pages 280–296. Springer, 2022.
  60. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
  61. Llavar: Enhanced visual instruction tuning for text-rich image understanding. CoRR, abs/2306.17107, 2023.
  62. DUE: end-to-end document understanding benchmark. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
  63. ICDAR 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pages 1563–1570. IEEE, 2019.
  64. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4566–4575. IEEE Computer Society, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Dongsheng Wang (47 papers)
  2. Natraj Raman (13 papers)
  3. Mathieu Sibue (5 papers)
  4. Zhiqiang Ma (19 papers)
  5. Petr Babkin (6 papers)
  6. Simerjot Kaur (14 papers)
  7. Yulong Pei (31 papers)
  8. Armineh Nourbakhsh (18 papers)
  9. Xiaomo Liu (17 papers)
Citations (32)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews