FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction (2305.02549v2)
Abstract: The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask LLMing to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
- Form2seq: A framework for higher-order form structure extraction. In EMNLP.
- Etc: Encoding long and structured data in transformers. In EMNLP.
- Docformer: End-to-end transformer for document understanding. In ICCV.
- Unilmv2: Pseudo-masked language models for unified language model pre-training. In ICML.
- A simple framework for contrastive learning of visual representations. In ICML.
- Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP.
- Self-supervised representation learning on document images. In International Workshop on Document Analysis Systems, pages 103–117. Springer.
- Timo I Denk and Christian Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding. arXiv preprint arXiv:1909.04948.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Lambert: Layout-aware (language) modeling for information extraction. arXiv preprint arXiv:2002.08087.
- Unified pretraining framework for document understanding. arXiv preprint arXiv:2204.10939.
- Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4583–4592.
- Recursive xy cut using bounding boxes of connected components. In ICDAR.
- Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning. PMLR.
- Mask r-cnn. In ICCV.
- Deep residual learning for image recognition. In CVPR.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia.
- Icdar2019 competition on scanned receipt ocr and information extraction. In ICDAR.
- Spatial dependency parsing for semi-structured document information extraction. In ACL-IJCNLP (Findings).
- Funsd: A dataset for form understanding in noisy scanned documents. In ICDAR-OST.
- Chargrid: Towards understanding 2d documents. In EMNLP.
- Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer.
- A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents. In ICPR.
- Formnet: Structural encoding beyond sequential modeling in form document information extraction. In ACL.
- Rope: Reading order equivariant positional encoding for graph-based document information extraction. In ACL-IJCNLP.
- Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval.
- Structurallm: Structural pre-training for form understanding. In ACL.
- Dit: Self-supervised pre-training for document image transformer. arXiv preprint arXiv:2203.02378.
- Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5652–5660.
- Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning, pages 3835–3845. PMLR.
- Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1912–1920.
- Feature pyramid networks for object detection. In CVPR.
- Vibertgrid: a jointly trained multi-modal 2d document representation for key information extraction from documents. In International Conference on Document Analysis and Recognition, pages 548–563. Springer.
- Representation learning for information extraction from form-like documents. In ACL.
- Artificial neural networks for document analysis and recognition. IEEE Transactions on pattern analysis and machine intelligence.
- Lawrence O’Gorman. 1993. The document spectrum for page layout analysis. IEEE Transactions on pattern analysis and machine intelligence.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- Cloudscan-a configuration-free invoice analysis system using recurrent neural networks. In ICDAR.
- Cord: A consolidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS 2019.
- Going full-tilt boogie on document understanding with text-image-layout transformer. In ICDAR.
- Towards a multi-modal, multi-task learning based pre-training framework for document representation learning. arXiv preprint arXiv:2009.14457.
- Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Conference on Computational Natural Language Learning (CoNLL).
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems.
- Imagenet large scale visual recognition challenge. IJCV.
- A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems.
- Wilson L Taylor. 1953. “cloze procedure”: A new tool for measuring readability. Journalism quarterly.
- Deep graph infomax. ICLR.
- Jesse Vig. 2019. A multiscale visualization of attention in the transformer model. In ACL: System Demonstrations.
- Lilt: A simple yet effective language-independent layout transformer for structured document understanding. arXiv preprint arXiv:2202.13669.
- Ernie-mmlayout: Multi-grained multimodal transformer for document understanding. Proceedings of the 30th ACM International Conference on Multimedia.
- Queryform: A simple zero-shot form entity query framework. arXiv preprint arXiv:2211.07730.
- Robust layout-aware ie for visually rich documents with pre-trained language models. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2367–2376.
- Representing long-range context for graph neural networks with global attention. Advances in Neural Information Processing Systems, 34:13266–13279.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In ACL-IJCNLP.
- Layoutlm: Pre-training of text and layout for document image understanding. In KDD.
- Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems.
- Cutie: Learning to understand documents with convolutional universal text information extractor. In ICDAR.
- Publaynet: largest dataset ever for document layout analysis. In ICDAR.
- An empirical study of graph contrastive learning. arXiv preprint arXiv:2109.01116.
- Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.