Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding (2404.05225v1)

Published 8 Apr 2024 in cs.CV and cs.CL

Abstract: Recently, leveraging LLMs or multimodal LLMs (MLLMs) for document understanding has been proven very promising. However, previous works that employ LLMs/MLLMs for document understanding have not fully explored and utilized the document layout information, which is vital for precise document understanding. In this paper, we propose LayoutLLM, an LLM/MLLM based method for document understanding. The core of LayoutLLM is a layout instruction tuning strategy, which is specially designed to enhance the comprehension and utilization of document layouts. The proposed layout instruction tuning strategy consists of two components: Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture the characteristics of document layout in Layout-aware Pre-training, three groups of pre-training tasks, corresponding to document-level, region-level and segment-level information, are introduced. Furthermore, a novel module called layout chain-of-thought (LayoutCoT) is devised to enable LayoutLLM to focus on regions relevant to the question and generate accurate answers. LayoutCoT is effective for boosting the performance of document understanding. Meanwhile, it brings a certain degree of interpretability, which could facilitate manual inspection and correction. Experiments on standard benchmarks show that the proposed LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding. The training data of the LayoutLLM is publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. GPT-4V(ision) system card. 2023.
  2. DocFormer: End-to-end transformer for document understanding. In ICCV, pages 4171–4186, 2021.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  4. Document layout analysis: a comprehensive survey. ACM Computing Surveys (CSUR), 52(6):1–36, 2019.
  5. Attention where it matters: Rethinking visual document understanding with selective region concentration. In ICCV, pages 19517–19527, 2023.
  6. M6doc: A large-scale multi-format, multi-type, multi-layout, multi-language, multi-annotation category dataset for modern document layout analysis. In CVPR, pages 15138–15147, 2023.
  7. Document ai: Benchmarks, models and applications. arXiv preprint arXiv:2111.08609, 2021.
  8. Vision grid transformer for document layout analysis. In ICCV, pages 19462–19472, 2023a.
  9. Multi-granularity prediction with learnable fusion for scene text recognition. arXiv preprint arXiv:2307.13244, 2023b.
  10. End-to-end document recognition and understanding with dessurt. In ECCV, pages 280–296. Springer, 2022.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  12. Unified pretraining framework for document understanding. In NeurIPS, 2021.
  13. Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In CVPR, pages 4583–4592, 2022.
  14. Evaluation of deep convolutional nets for document image classification and retrieval. In ICDAR, pages 991–995. IEEE, 2015.
  15. Icl-d3ie: In-context learning with diverse demonstrations updating for document information extraction. ICCV, 2023.
  16. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In AAAI, 2022.
  17. Layoutlmv3: Pre-training for document ai with unified text and image masking. In ACM Multimedia, 2022.
  18. ICDAR2019 competition on scanned receipt OCR and information extraction. In ICDAR. IEEE, 2019.
  19. Funsd: A dataset for form understanding in noisy scanned documents, 2019.
  20. Ocr-free document understanding transformer. In ECCV, pages 498–517. Springer, 2022.
  21. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213, 2022.
  22. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In ICML, pages 18893–18912. PMLR, 2023.
  23. StructuralLM: Structural pre-training for form understanding. In ACL, 2021a.
  24. Docbank: A benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038, 2020.
  25. SelfDoc: Self-supervised document representation learning. In CVPR, pages 5652–5660, 2021b.
  26. Structext: Structured text understanding with multi-modal transformers. In ACM Multimedia, pages 1912–1920, 2021c.
  27. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE transactions on pattern analysis and machine intelligence, 45(1):919–931, 2022.
  28. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  29. Visual instruction tuning, 2023b.
  30. On the hidden mystery of ocr in large multimodal models, 2023c.
  31. Bi-vldoc: Bidirectional vision-language modeling for visually-rich document understanding. arXiv preprint arXiv:2206.13155, 2022.
  32. Geolayoutlm: Geometric pre-training for visual information extraction. In CVPR, pages 7092–7101, 2023.
  33. Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021.
  34. Fetaqa: Free-form table question answering. TACL, 10:35–49, 2022.
  35. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  36. R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
  37. {CORD}: A consolidated receipt dataset for post-{ocr} parsing. In Workshop on Document Intelligence at NeurIPS 2019, 2019.
  38. Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. In EMNLP findings, 2022.
  39. Lmdx: Language model-based document information extraction and localization. arXiv preprint arXiv:2309.10952, 2023.
  40. DocLayNet: A large human-annotated dataset for document-layout segmentation. In SIGKDD. ACM, 2022.
  41. Going full-tilt boogie on document understanding with text-image-layout transformer. In ICDAR, pages 732–747. Springer, 2021.
  42. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  43. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
  44. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2016.
  45. Exploring ocr capabilities of GPT-4V(ision): A quantitative and in-depth evaluation. arXiv preprint arXiv:2310.16809, 2023.
  46. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pages 742–758. Springer, 2020.
  47. Visualmrc: Machine reading comprehension on document images. In AAAI, pages 13878–13888, 2021.
  48. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  49. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  50. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  51. LayoutMask: Enhance text-layout interaction in multi-modal pre-training for document understanding. In ACL, 2023.
  52. Lilt: A simple yet effective language-independent layout transformer for structured document understanding. In ACL, 2022.
  53. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
  54. LayoutLM: Pre-training of text and layout for document image understanding. In KDD, pages 1192–1200, 2020.
  55. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In ACL, 2021.
  56. The dawn of lmms: Preliminary explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421, 9, 2023.
  57. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023a.
  58. mplug-owl: Modularization empowers large language models with multimodality, 2023b.
  59. Structextv2: Masked visual-textual prediction for document image pre-training. In ICLR, 2023.
  60. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023.
  61. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  62. Publaynet: largest dataset ever for document layout analysis. In ICDAR, pages 1015–1022. IEEE, 2019.
  63. Image-based table recognition: data, model, and evaluation. In ECCV, pages 564–580. Springer, 2020.
  64. East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5551–5560, 2017.
  65. Docile benchmark for document information localization and extraction, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chuwei Luo (8 papers)
  2. Yufan Shen (5 papers)
  3. Zhaoqing Zhu (10 papers)
  4. Qi Zheng (62 papers)
  5. Zhi Yu (33 papers)
  6. Cong Yao (70 papers)
Citations (20)
Youtube Logo Streamline Icon: https://streamlinehq.com