Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DocGraphLM: Documental Graph Language Model for Information Extraction (2401.02823v1)

Published 5 Jan 2024 in cs.CL and cs.IR

Abstract: Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained LLMs with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 993–1003.
  2. Visual FUDGE: Form Understanding via Dynamic Graph Editing. CoRR abs/2105.08194 (2021). arXiv:2105.08194 https://arxiv.org/abs/2105.08194
  3. LAMBERT: Layout-Aware language Modeling using BERT for information extraction. CoRR abs/2002.08087 (2020). arXiv:2002.08087 https://arxiv.org/abs/2002.08087
  4. Doc2Graph: a Task Agnostic Document Understanding Framework based on Graph Neural Networks. arXiv preprint arXiv:2208.11168 (2022).
  5. Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017).
  6. BROS: a pre-trained language model for understanding texts in document. (2020).
  7. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia. 4083–4091.
  8. Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2. IEEE, 1–6.
  9. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. arXiv preprint arXiv:2102.09550 (2021).
  10. Formnet: Structural encoding beyond sequential modeling in form document information extraction. arXiv preprint arXiv:2203.08411 (2022).
  11. ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 314–321. https://doi.org/10.18653/v1/2021.acl-short.41
  12. Structurallm: Structural pre-training for form understanding. arXiv preprint arXiv:2105.11210 (2021).
  13. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5652–5660.
  14. Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia. 1912–1920.
  15. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv abs/1907.11692 (2019).
  16. Representation learning for information extraction from form-like documents. In proceedings of the 58th annual meeting of the Association for Computational Linguistics. 6495–6504.
  17. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209.
  18. CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019.
  19. GraphIE: A Graph-Based Framework for Information Extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 751–761. https://doi.org/10.18653/v1/N19-1082
  20. The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.
  21. Layoutreader: Pre-training of text and layout for reading order detection. arXiv preprint arXiv:2108.11591 (2021).
  22. Docstruct: A multimodal method to extract hierarchy structure in document for general form understanding. arXiv preprint arXiv:2010.11685 (2020).
  23. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1192–1200.
  24. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740 (2020).
  25. Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 7370–7377.
  26. PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 4363–4370.
  27. Every document owns its structure: Inductive text classification via graph neural networks. arXiv preprint arXiv:2004.13826 (2020).
Citations (5)

Summary

  • The paper introduces DocGraphLM, a framework that integrates GNNs with LLMs to predict document element relationships using a novel link prediction strategy.
  • The methodology converts text segments into graph nodes and employs a multi-task approach to assess relationships, distances, and directions.
  • The evaluation demonstrates significant performance gains and faster convergence on IE and QA tasks across datasets like FUNSD, CORD, and DocVQA.

Introduction to DocGraphLM

The field of information extraction from Visually Rich Documents (VrDs) such as PDFs, scans, or images of business forms is rapidly progressing, providing important tools for digitalization and information retrieval systems. Traditional approaches to this challenge have largely depended on pre-trained LLMs like transformers, as well as Graph Neural Networks (GNNs). However, these models often struggle with understanding spatially complex documents where crucial information can be widely separated.

To address this, researchers have developed DocGraphLM, a cutting-edge framework that synergizes the semantic capabilities of pre-trained LLMs with the structural advantages of graph representations. This approach serves to enhance the understanding of documents by amalgamating the two distinct models, offering a solution that is more adept at dealing with the complexities of VrDs.

Advancements in Model Architecture

DocGraphLM introduces a new architecture by integrating a GNN into the currently prevalent pre-trained LLM framework. This integration allows for a novel link prediction strategy focused on predicting the relationships between document elements—both in terms of their direction and distances. The model is designed to restore local neighborhood structures effectively, prioritizing close connections while reducing the focus on distant node interactions through a logarithmic transformation of distances.

The model’s experimentation with various datasets revealed that the incorporation of graph features not only enhanced performance on a range of Information Extraction (IE) and Question Answering (QA) tasks but also showed a significantly quicker convergence during training phases.

Methodologies and Link Predictions

In the construction of DocGraphLM, text segments get converted into graph nodes, and the relationships between these segments become the graph edges. The entire process is supported by Optical Character Recognition (OCR) tools. The novel D-LoS heuristic used in place of K-nearest-neighbours or β-skeleton approaches allows the model to aptly define relationships between distant yet relevant document elements.

The handling of node and edge representations is meticulously detailed in the paper, showing how text semantics and node dimensions are both crucial to understanding the role of different text blocks within a document. The model predicts node relationships through a multi-task learning approach that simultaneously assesses the distance and direction, employing different heads for regression and classification tasks, which are then optimized through a joint loss function.

Evaluation and Findings

DocGraphLM has been put to the test across prominent datasets: FUNSD for form understanding, CORD for receipt understanding, and DocVQA for visual question answering with image documents. The results demonstrate its consistency and superiority over established models that previously set the standard, such as LayoutLM.

Across these datasets, DocGraphLM registered marked improvements in task performance, with the combination of graph features significantly boosting the models' abilities. The recorded p-values below 0.05 signal a statistically significant enhancement due to the incorporation of graph-based learning. Additionally, the increased convergence speed observed suggests that graph features facilitate a more targeted focus during training, contributing to the overall efficiency of the model.

In conclusion, DocGraphLM constitutes a significant development in the understanding of VrDs. Looking forward, the researchers aim to expand on this success by experimenting with various pre-training strategies and linkage representations, potentially unlocking new heights in the field of document understanding.