DocGraphLM: Documental Graph Language Model for Information Extraction (2401.02823v1)
Abstract: Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained LLMs with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.
- Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 993–1003.
- Visual FUDGE: Form Understanding via Dynamic Graph Editing. CoRR abs/2105.08194 (2021). arXiv:2105.08194 https://arxiv.org/abs/2105.08194
- LAMBERT: Layout-Aware language Modeling using BERT for information extraction. CoRR abs/2002.08087 (2020). arXiv:2002.08087 https://arxiv.org/abs/2002.08087
- Doc2Graph: a Task Agnostic Document Understanding Framework based on Graph Neural Networks. arXiv preprint arXiv:2208.11168 (2022).
- Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017).
- BROS: a pre-trained language model for understanding texts in document. (2020).
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia. 4083–4091.
- Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2. IEEE, 1–6.
- Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. arXiv preprint arXiv:2102.09550 (2021).
- Formnet: Structural encoding beyond sequential modeling in form document information extraction. arXiv preprint arXiv:2203.08411 (2022).
- ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 314–321. https://doi.org/10.18653/v1/2021.acl-short.41
- Structurallm: Structural pre-training for form understanding. arXiv preprint arXiv:2105.11210 (2021).
- Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5652–5660.
- Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia. 1912–1920.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv abs/1907.11692 (2019).
- Representation learning for information extraction from form-like documents. In proceedings of the 58th annual meeting of the Association for Computational Linguistics. 6495–6504.
- Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209.
- CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019.
- GraphIE: A Graph-Based Framework for Information Extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 751–761. https://doi.org/10.18653/v1/N19-1082
- The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.
- Layoutreader: Pre-training of text and layout for reading order detection. arXiv preprint arXiv:2108.11591 (2021).
- Docstruct: A multimodal method to extract hierarchy structure in document for general form understanding. arXiv preprint arXiv:2010.11685 (2020).
- Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1192–1200.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740 (2020).
- Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 7370–7377.
- PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 4363–4370.
- Every document owns its structure: Inductive text classification via graph neural networks. arXiv preprint arXiv:2004.13826 (2020).