Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 57 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction (2305.02549v2)

Published 4 May 2023 in cs.CL, cs.CV, and cs.LG

Abstract: The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask LLMing to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Form2seq: A framework for higher-order form structure extraction. In EMNLP.
  2. Etc: Encoding long and structured data in transformers. In EMNLP.
  3. Docformer: End-to-end transformer for document understanding. In ICCV.
  4. Unilmv2: Pseudo-masked language models for unified language model pre-training. In ICML.
  5. A simple framework for contrastive learning of visual representations. In ICML.
  6. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP.
  7. Self-supervised representation learning on document images. In International Workshop on Document Analysis Systems, pages 103–117. Springer.
  8. Timo I Denk and Christian Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding. arXiv preprint arXiv:1909.04948.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  11. Lambert: Layout-aware (language) modeling for information extraction. arXiv preprint arXiv:2002.08087.
  12. Unified pretraining framework for document understanding. arXiv preprint arXiv:2204.10939.
  13. Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4583–4592.
  14. Recursive xy cut using bounding boxes of connected components. In ICDAR.
  15. Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning. PMLR.
  16. Mask r-cnn. In ICCV.
  17. Deep residual learning for image recognition. In CVPR.
  18. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia.
  19. Icdar2019 competition on scanned receipt ocr and information extraction. In ICDAR.
  20. Spatial dependency parsing for semi-structured document information extraction. In ACL-IJCNLP (Findings).
  21. Funsd: A dataset for form understanding in noisy scanned documents. In ICDAR-OST.
  22. Chargrid: Towards understanding 2d documents. In EMNLP.
  23. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer.
  24. A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents. In ICPR.
  25. Formnet: Structural encoding beyond sequential modeling in form document information extraction. In ACL.
  26. Rope: Reading order equivariant positional encoding for graph-based document information extraction. In ACL-IJCNLP.
  27. Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval.
  28. Structurallm: Structural pre-training for form understanding. In ACL.
  29. Dit: Self-supervised pre-training for document image transformer. arXiv preprint arXiv:2203.02378.
  30. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5652–5660.
  31. Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning, pages 3835–3845. PMLR.
  32. Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1912–1920.
  33. Feature pyramid networks for object detection. In CVPR.
  34. Vibertgrid: a jointly trained multi-modal 2d document representation for key information extraction from documents. In International Conference on Document Analysis and Recognition, pages 548–563. Springer.
  35. Representation learning for information extraction from form-like documents. In ACL.
  36. Artificial neural networks for document analysis and recognition. IEEE Transactions on pattern analysis and machine intelligence.
  37. Lawrence O’Gorman. 1993. The document spectrum for page layout analysis. IEEE Transactions on pattern analysis and machine intelligence.
  38. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  39. Cloudscan-a configuration-free invoice analysis system using recurrent neural networks. In ICDAR.
  40. Cord: A consolidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS 2019.
  41. Going full-tilt boogie on document understanding with text-image-layout transformer. In ICDAR.
  42. Towards a multi-modal, multi-task learning based pre-training framework for document representation learning. arXiv preprint arXiv:2009.14457.
  43. Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Conference on Computational Natural Language Learning (CoNLL).
  44. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems.
  45. Imagenet large scale visual recognition challenge. IJCV.
  46. A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  47. Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems.
  48. Wilson L Taylor. 1953. “cloze procedure”: A new tool for measuring readability. Journalism quarterly.
  49. Deep graph infomax. ICLR.
  50. Jesse Vig. 2019. A multiscale visualization of attention in the transformer model. In ACL: System Demonstrations.
  51. Lilt: A simple yet effective language-independent layout transformer for structured document understanding. arXiv preprint arXiv:2202.13669.
  52. Ernie-mmlayout: Multi-grained multimodal transformer for document understanding. Proceedings of the 30th ACM International Conference on Multimedia.
  53. Queryform: A simple zero-shot form entity query framework. arXiv preprint arXiv:2211.07730.
  54. Robust layout-aware ie for visually rich documents with pre-trained language models. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2367–2376.
  55. Representing long-range context for graph neural networks with global attention. Advances in Neural Information Processing Systems, 34:13266–13279.
  56. Unsupervised feature learning via non-parametric instance discrimination. In CVPR.
  57. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In ACL-IJCNLP.
  58. Layoutlm: Pre-training of text and layout for document image understanding. In KDD.
  59. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems.
  60. Cutie: Learning to understand documents with convolutional universal text information extractor. In ICDAR.
  61. Publaynet: largest dataset ever for document layout analysis. In ICDAR.
  62. An empirical study of graph contrastive learning. arXiv preprint arXiv:2109.01116.
  63. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131.
Citations (15)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.