LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding (2403.14252v1)
Abstract: This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained significant attention due to their importance. Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure. However, these methods require fine-tuning for each task and dataset, and the models are expensive to train and operate. To overcome this limitation, we propose a new LayoutLLM that integrates these with large-scale LLMs. By leveraging the strengths of existing research in document image understanding and LLMs' superior language understanding capabilities, the proposed model, fine-tuned with multimodal instruction datasets, performs an understanding of document images in a single model. Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
- Jaided AI. 2020. Easyocr. https://github.com/JaidedAI/EasyOCR.
- On the multilingual capabilities of very large-scale English language models. In LREC, pages 3056–3068.
- Language models are few-shot learners. In NeurIPS, volume 33, pages 1877–1901.
- François Chollet. 2019. On the measure of intelligence. arXiv preprint arXiv:1911.01547.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- Masato Fujitake. 2023a. A3s: Adversarial learning of semantic representations for scene-text spotting. In ICASSP, pages 1–5.
- Masato Fujitake. 2023b. Diffusionstr: Diffusion model for scene text recognition. In ICIP, pages 1585–1589.
- Masato Fujitake. 2024. Dtrocr: Decoder-only transformer for optical character recognition. In WACV, pages 8025–8035.
- Lambert: layout-aware language modeling for information extraction. In ICDAR, pages 532–547.
- Unidoc: Unified pretraining framework for document understanding. NeurIPS, 34:39–50.
- Measuring massive multitask language understanding. In ICLR.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In ACM MM, pages 4083–4091.
- Ocr-free document understanding transformer. In ECCV, pages 498–517.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In ICML, pages 18893–18912.
- Structurallm: Structural pre-training for form understanding. In ACL, pages 6309–6318.
- Dit: Self-supervised pre-training for document image transformer. In ACM MM, page 3530–3539.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In ICLR, pages 1–10.
- Going full-tilt boogie on document understanding with text-image-layout transformer. In ICDAR, pages 732–747.
- Improving language understanding by generative pre-training.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Unifying vision, text, and layout for universal document processing. In CVPR, pages 19254–19264.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In ACL, pages 2579–2591.
- Layoutlm: Pre-training of text and layout for document image understanding. In ACM KDD, pages 1192–1200.
- mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Evaluation of deep convolutional nets for document image classification and retrieval. In ICDAR, pages 991–995.
- Funsd: A dataset for form understanding in noisy scanned documents. In ICDARW, volume 2, pages 1–6.
- Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209.
- Cord: a consolidated receipt dataset for post-ocr parsing. In NeurIPSW.
- Stanford alpaca: An instruction-following llama model.
- Masato Fujitake (8 papers)