Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis (2403.04080v1)

Published 6 Mar 2024 in cs.CL and cs.CV

Abstract: This paper presents a comprehensive survey of research works on the topic of form understanding in the context of scanned documents. We delve into recent advancements and breakthroughs in the field, highlighting the significance of LLMs and transformers in solving this challenging task. Our research methodology involves an in-depth analysis of popular documents and forms of understanding of trends over the last decade, enabling us to offer valuable insights into the evolution of this domain. Focusing on cutting-edge models, we showcase how transformers have propelled the field forward, revolutionizing form-understanding techniques. Our exploration includes an extensive examination of state-of-the-art LLMs designed to effectively tackle the complexities of noisy scanned documents. Furthermore, we present an overview of the latest and most relevant datasets, which serve as essential benchmarks for evaluating the performance of selected models. By comparing and contrasting the capabilities of these models, we aim to provide researchers and practitioners with useful guidance in choosing the most suitable solutions for their specific form understanding tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (106)
  1. Deep visual template-free form parsing, 2019.
  2. Tncr: Table net detection and classification dataset. Neurocomputing, 473:79–97, 2022.
  3. Publaynet: largest dataset ever for document layout analysis, 2019.
  4. Building a test collection for complex document information processing. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006.
  5. Handwritten kazakh and russian (hkr) database for text recognition. Multimedia Tools and Applications, 80:33075–33097, 2021.
  6. Classification of handwritten names of cities and handwritten text recognition using various deep learning models. arXiv preprint arXiv:2102.04816, 2021.
  7. Ocr-free document understanding transformer, 2022.
  8. Kohtd: Kazakh offline handwritten text dataset. Signal Processing: Image Communication, 108:116827, 2022.
  9. Deep learning for table detection and structure recognition: A survey. arXiv preprint arXiv:2211.08469, 2022.
  10. Language modeling with deep transformers. arXiv preprint arXiv:1905.04226, 2019.
  11. Language models with transformers, 2019.
  12. Amurd: annotated multilingual receipts dataset for cross-lingual key information extraction and classification. arXiv preprint arXiv:2309.09800, 2023.
  13. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):87–110, 2023.
  14. Cvt: Introducing convolutions to vision transformers, 2021.
  15. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  17. Language models are unsupervised multitask learners. 2019.
  18. Generator-retriever-generator: A novel approach to open-domain question answering. arXiv preprint arXiv:2307.11278, 2023.
  19. Exploring the state of the art in legal qa systems. arXiv preprint arXiv:2304.06623, 2023.
  20. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  21. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  22. Layoutlm: Pre-training of text and layout for document image understanding. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
  23. Document layout analysis: A comprehensive survey. ACM Computing Surveys (CSUR), 52(6):1–36, 2019.
  24. A survey of deep learning approaches for ocr and document understanding. arXiv preprint arXiv:2011.13534, 2020.
  25. Document image analysis and recognition: a survey. computer optics, 46(4):567–589, 2022.
  26. Camera-based analysis of text and documents: a survey. International Journal of Document Analysis and Recognition (IJDAR), 7:84–104, 2005.
  27. Alexey Shigarov. Table understanding: Problem overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(1):e1482, 2023.
  28. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In ACL/IJCNLP, 2021.
  29. A knowledge-based segmentation method for document understanding. space, 50(60):10, 1986.
  30. Analysis of form images. International journal of pattern recognition and artificial intelligence, 8(05):1031–1052, 1994.
  31. Automatic document processing: A survey. Pattern Recognition, 29(12):1931–1952, 1996.
  32. Automatic Analysis and Understanding of Documents, chapter 3.6, pages 625–654. 1993.
  33. Vtlayout: Fusion of visual and text features for document layout analysis. In Duc Nghia Pham, Thanaruk Theeramunkong, Guido Governatori, and Fenrong Liu, editors, PRICAI 2021: Trends in Artificial Intelligence, pages 308–322, Cham, 2021. Springer International Publishing.
  34. Layoutdiffusion: Controllable diffusion model for layout-to-image generation, 2023.
  35. Fusion of visual representations for multimodal information extraction from unstructured transactional documents. International Journal on Document Analysis and Recognition (IJDAR), 25(3):187–205, 2022.
  36. Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016.
  37. GraphRel: Modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1409–1418, Florence, Italy, July 2019. Association for Computational Linguistics.
  38. Deep learning, graph-based text representation and classification: a survey, perspectives and challenges. Artificial Intelligence Review, 56(6):4893–4927, 2023.
  39. Pick: Processing key information extraction from documents using improved graph learning-convolutional networks, 2020.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. Deep residual learning for image recognition, 2015.
  42. Trie: End-to-end text reading and information extraction for document understanding, 2020.
  43. Aggregated residual transformations for deep neural networks, 2017.
  44. Feature pyramid networks for object detection, 2017.
  45. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  46. Multimodal learning with transformers: A survey, 2023.
  47. Multi-modal transformer for accelerated mr imaging, 2022.
  48. Mutualformer: Multi-modality representation learning via cross-diffusion attention, 2023.
  49. Selfdoc: Self-supervised document representation learning, 2021.
  50. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  51. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021.
  52. Unilmv2: Pseudo-masked language models for unified language model pre-training, 2020.
  53. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  54. Docformer: End-to-end transformer for document understanding, 2021.
  55. Structext: Structured text understanding with multi-modal transformers, 2021.
  56. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents, 2021.
  57. Unified pretraining framework for document understanding, May 18 2023. US Patent App. 17/528,061.
  58. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  59. Self-supervised relationship probing. Advances in Neural Information Processing Systems, 33:1841–1853, 2020.
  60. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  61. Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 732–747. Springer, 2021.
  62. Wikireading: A novel large-scale language understanding task over wikipedia. arXiv preprint arXiv:1608.03542, 2016.
  63. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  64. Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512, 2019.
  65. Enforcing encoder-decoder modularity in sequence-to-sequence models, 2019.
  66. Evaluating sequence-to-sequence models for handwritten text recognition, 2019.
  67. Distillation of encoder-decoder transformers for sequence labelling, 2023.
  68. Decoder-only or encoder-decoder? interpreting language model as a regularized encoder-decoder, 2023.
  69. Sequence-to-sequence pre-training with unified modality masking for visual document understanding. arXiv preprint arXiv:2305.10448, 2023.
  70. Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. arXiv preprint arXiv:2210.06155, 2022.
  71. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  72. Docformerv2: Local features for document understanding. arXiv preprint arXiv:2306.01733, 2023.
  73. Transformer encoder with multi-modal multi-head attention for continuous affect recognition. IEEE Transactions on Multimedia, 23:4171–4183, 2020.
  74. Two-stage multimodality fusion for high-performance text-based visual question answering. In Proceedings of the Asian Conference on Computer Vision, pages 4143–4159, 2022.
  75. Formnet: Structural encoding beyond sequential modeling in form document information extraction, 2022.
  76. Flat2layout: Flat representation for estimating layout of general room types, 2019.
  77. Unifying vision, text, and layout for universal document processing, 2023.
  78. Lilt: A simple yet effective language-independent layout transformer for structured document understanding, 2022.
  79. Language independent neuro-symbolic semantic parsing for form understanding, 2023.
  80. Enabling large language models to generate text with citations, 2023.
  81. Navigation-based candidate expansion and pretrained language models for citation recommendation. Scientometrics, 125:3001–3016, 2020.
  82. Multi-scale cell-based layout representation for document understanding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3670–3679, January 2023.
  83. StructuralLM: Structural pre-training for form understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6309–6318, Online, August 2021. Association for Computational Linguistics.
  84. Structextv2: Masked visual-textual prediction for document image pre-training. arXiv preprint arXiv:2303.00289, 2023.
  85. Fast-structext: An efficient hourglass transformer with modality-guided dynamic token merge for document understanding. arXiv preprint arXiv:2305.11392, 2023.
  86. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  87. Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
  88. Rtformer: Efficient design for real-time semantic segmentation with transformer. Advances in Neural Information Processing Systems, 35:7423–7436, 2022.
  89. Evaluation of deep convolutional nets for document image classification and retrieval. In International Conference on Document Analysis and Recognition (ICDAR).
  90. Cdip dataset. https://ir.nist.gov/cdip/, 2011.
  91. Building digital tobacco industry document libraries at the university of california, san francisco library/center for knowledge management. D-Lib Magazine, 8(9):1082–9873, 2002.
  92. The iit cdip test collection. https://ir.nist.gov/cdip/README.txt, 2021.
  93. Jean-Philippe Thiran Guillaume Jaume, Hazim Kemal Ekenel. Funsd: A dataset for form understanding in noisy scanned documents. In Accepted to ICDAR-OST, 2019.
  94. Hieu M. Vu and Diep Thi-Ngoc Nguyen. Revising funsd dataset for key-value detection in document images, 2020.
  95. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding, 2021.
  96. Microsoft. What is optical character recognition? https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/overview-ocr#read-api, 2022.
  97. National archives forms dataset. https://github.com/herobd/NAF_dataset, 2019.
  98. Yusuke Shinyama et al. Pdfminer. https://github.com/euske/pdfminer, 2019.
  99. Icdar2019 competition on scanned receipt ocr and information extraction. 2019 International Conference on Document Analysis and Recognition (ICDAR), Sep 2019.
  100. Cord: A consolidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS 2019, 2019.
  101. Fabrizio Primerano. Availability of full dataset. https://github.com/clovaai/cord/issues/4, 2021.
  102. Docvqa: A dataset for vqa on document images. https://www.docvqa.org/datasets, 2021.
  103. UCSF Library. Industry documents library. https://www.industrydocuments.ucsf.edu/.
  104. Form-nlu: Dataset for the form language understanding. arXiv preprint arXiv:2304.01577, 2023.
  105. Scene text visual question answering, 2019.
  106. Dit: Self-supervised pre-training for document image transformer. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3530–3539, 2022.
Citations (6)

Summary

We haven't generated a summary for this paper yet.