Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visually Guided Generative Text-Layout Pre-training for Document Intelligence (2403.16516v2)

Published 25 Mar 2024 in cs.CL and cs.CV

Abstract: Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Docformer: End-to-end transformer for document understanding. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 973–983.
  2. Character region awareness for text detection. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9357–9366.
  3. Wukong-reader: Multi-modal pre-training for fine-grained visual document understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13386–13401, Toronto, Canada. Association for Computational Linguistics.
  4. Longformer: The long-document transformer.
  5. Recurrent memory transformer. In Advances in Neural Information Processing Systems, volume 35, pages 11079–11091. Curran Associates, Inc.
  6. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations.
  7. WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4173–4185, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  8. Complicated table structure recognition.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 49250–49267. Curran Associates, Inc.
  11. End-to-end document recognition and understanding with dessurt. In Computer Vision - ECCV Workshops: Tel Aviv, Israel, Proceedings, Part IV, page 280–296, Berlin, Heidelberg. Springer-Verlag.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
  13. Unidoc: Unified pretraining framework for document understanding. In Advances in Neural Information Processing Systems, volume 34, pages 39–50. Curran Associates, Inc.
  14. Evaluation of deep convolutional nets for document image classification and retrieval. ICDAR 2015, page 991–995.
  15. Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus).
  16. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, page 4083–4091, New York, NY, USA. Association for Computing Machinery.
  17. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520.
  18. Spatial dependency parsing for semi-structured document information extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 330–343, Online. Association for Computational Linguistics.
  19. Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), volume 2, pages 1–6.
  20. Chargrid: Towards understanding 2D documents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4459–4469, Brussels, Belgium. Association for Computational Linguistics.
  21. Prestu: Pre-training for scene-text understanding. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15224–15234.
  22. Ocr-free document understanding transformer. In Computer Vision - ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, page 498–517, Berlin, Heidelberg. Springer-Verlag.
  23. C. Lee and S. Osindero. 2016. Recursive recurrent nets with attention modeling for ocr in the wild. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2231–2239.
  24. FormNet: Structural encoding beyond sequential modeling in form document information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3735–3754, Dublin, Ireland. Association for Computational Linguistics.
  25. FormNetV2: Multimodal graph contrastive learning for form document information extraction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9011–9026, Toronto, Canada. Association for Computational Linguistics.
  26. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 18893–18912. PMLR.
  27. Building a test collection for complex document information processing. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 665–666, New York, NY, USA. Association for Computing Machinery.
  28. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  29. StructuralLM: Structural pre-training for form understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6309–6318, Online. Association for Computational Linguistics.
  30. Trocr: Transformer-based optical character recognition with pre-trained models. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11):13094–13102.
  31. DocBank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics, pages 949–960, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  32. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5652–5660, Nashville, TN, USA.
  33. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc.
  34. Roberta: A robustly optimized bert pretraining approach.
  35. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA.
  36. Representation learning for information extraction from form-like documents. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6495–6504, Online. Association for Computational Linguistics.
  37. The iam-database: An english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5:39–46.
  38. Infographicvqa. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2582–2591.
  39. Docvqa: A dataset for vqa on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2199–2208.
  40. OpenAI. 2023. Gpt-4 technical report.
  41. Cord: A consolidated receipt dataset for post-ocr parsing.
  42. ERNIE-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3744–3756, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  43. Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, Proceedings, Part II 16, pages 732–747. Springer.
  44. Language models are unsupervised multitask learners.
  45. Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement.
  46. U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015. MICCAI.
  47. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision.
  48. Llama 2: Open foundation and fine-tuned chat models.
  49. Docllm: A layout-aware generative language model for multimodal document understanding.
  50. LiLT: A simple yet effective language-independent layout transformer for structured document understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7747–7757, Dublin, Ireland. Association for Computational Linguistics.
  51. MGDoc: Pre-training with multi-granular hierarchy for document image understanding. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3984–3993, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  52. Christian Wolf and Jean-Michel Jolion. 2006. Object count/area graphs for the evaluation of object detection and segmentation algorithms. Document Analysis and Recognition, 8:280–296.
  53. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, Online. Association for Computational Linguistics.
  54. Layoutlm: Pre-training of text and layout for document image understanding. In KDD 2020, page 1192–1200, New York, NY, USA. Association for Computing Machinery.
  55. mplug-docowl: Modularized multimodal large language model for document understanding.
  56. Llavar: Enhanced visual instruction tuning for text-rich image understanding.
  57. Publaynet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022.
  58. East: An efficient and accurate scene text detector. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2642–2651.
  59. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhiming Mao (4 papers)
  2. Haoli Bai (24 papers)
  3. Lu Hou (50 papers)
  4. Jiansheng Wei (10 papers)
  5. Xin Jiang (242 papers)
  6. Qun Liu (230 papers)
  7. Kam-Fai Wong (92 papers)
Citations (5)