Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs (2311.13194v2)
Abstract: In the field of document understanding, significant advances have been made in the fine-tuning of Multimodal LLMs (MLLMs) with instruction-following data. Nevertheless, the potential of text-grounding capability within text-rich scenarios remains underexplored. In this paper, we present a text-grounding document understanding model, termed TGDoc, which addresses this deficiency by enhancing MLLMs with the ability to discern the spatial positioning of text within images. Empirical evidence suggests that text-grounding improves the model's interpretation of textual content, thereby elevating its proficiency in comprehending text-rich images. Specifically, we compile a dataset containing 99K PowerPoint presentations sourced from the internet. We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and LLM. Moreover, we curate a collection of text-rich images and prompt the text-only GPT-4 to generate 12K high-quality conversations, featuring textual locations within text-rich scenarios. By integrating text location data into the instructions, TGDoc is adept at discerning text locations during the visual question process. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
- Flamingo: a visual language model for few-shot learning. Proceedings of the Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
- ICDAR 2019 competition on scene text visual question answering. In Proceedings of the International Conference on Document Analysis and Recognition, pages 1563–1570. IEEE, 2019.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. URL https://vicuna. lmsys. org, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- EVA: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
- UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. arXiv preprint arXiv:2308.11592, 2023.
- ICDAR2019 competition on scanned receipt OCR and information extraction. In Proceedings of the International Conference on Document Analysis and Recognition, pages 1516–1520. IEEE, 2019.
- FUNSD: A dataset for form understanding in noisy scanned documents. In Proceedings of the International Conference on Document Analysis and Recognition Workshops, pages 1–6. IEEE, 2019.
- Visual information extraction in the wild: practical dataset and end-to-end solution. In Proceedings of the International Conference on Document Analysis and Recognition, pages 36–53. Springer, 2023.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- On the hidden mystery of OCR in large multimodal models. arXiv preprint arXiv:2305.07895, 2023b.
- SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- KOSMOS-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
- ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- DocVQA: A dataset for VQA on document images. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021.
- InfographicVQA. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
- OCR-VQA: Visual question answering by reading text in images. In Proceedings of the International Conference on Document Analysis and Recognition, pages 947–952. IEEE, 2019.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. Proceedings of the Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- KOSMOS-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Ronald Rivest. The md5 message-digest algorithm. Technical report, 1992.
- LAION-400M: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Towards VQA models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Plug-and-Play VQA: Zero-shot VQA by conjoining large pretrained models with zero training. arXiv preprint arXiv:2210.08773, 2022.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Self-Instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- LayoutReader: Pre-training of text and layout for reading order detection. arXiv preprint arXiv:2108.11591, 2021.
- Chain-of-thought prompting elicits reasoning in large language models. Proceedings of the Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196, 2023.
- LayoutLM: Pre-training of text and layout for document image understanding. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200, 2020a.
- LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740, 2020b.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023.
- mPLUG-DocOwl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023a.
- UReader: Universal OCR-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023b.
- mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023c.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
- StrucTexTv2: Masked visual-textual prediction for document image pre-training. arXiv preprint arXiv:2303.00289, 2023.
- Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023a.
- LLaVAR: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023b.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.