TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding (2404.09797v1)
Abstract: The advent of Large Multimodal Models (LMMs) has sparked a surge in research aimed at harnessing their remarkable reasoning abilities. However, for understanding text-rich images, challenges persist in fully leveraging the potential of LMMs, and existing methods struggle with effectively processing high-resolution images. In this work, we propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. TextCoT utilizes the captioning ability of LMMs to grasp the global context of the image and the grounding capability to examine local textual regions. This allows for the extraction of both global and local visual information, facilitating more accurate question-answering. Technically, TextCoT consists of three stages, including image overview, coarse localization, and fine-grained observation. The image overview stage provides a comprehensive understanding of the global scene information, and the coarse localization stage approximates the image area containing the answer based on the question asked. Then, integrating the obtained global image descriptions, the final stage further examines specific regions to provide accurate answers. Our method is free of extra training, offering immediate plug-and-play functionality. Extensive experiments are conducted on a series of text-rich image question-answering benchmark datasets based on several advanced LMMs, and the results demonstrate the effectiveness and strong generalization ability of our method. Code is available at https://github.com/bzluan/TextCoT.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023b.
- Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 17682–17690, 2024.
- Scene text visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 4291–4301, 2019.
- Language models are few-shot learners. Proceedings of the Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Dualfocus: Integrating macro and micro perspectives in multi-modal large language models. arXiv preprint arXiv:2402.14767, 2024.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023b.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org, 2(3):6, 2023.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. Proceedings of the Advances in Neural Information Processing Systems, 36, 2024.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. arXiv preprint arXiv:2311.11810, 2023a.
- Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. arXiv preprint arXiv:2308.11592, 2023b.
- Prompt me up: Unleashing the power of alignments for multimodal entity and relation extraction. In Proceedings of the ACM International Conference on Multimedia, pages 5185–5194, 2023.
- Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911, 2023.
- Icdar2019 competition on scanned receipt ocr and information extraction. In Proceedings of the International Conference on Document Analysis and Recognition, 2019.
- Funsd: A dataset for form understanding in noisy scanned documents. In Proceedings of the International Conference on Document Analysis and Recognition Workshops, 2019.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods In Natural Language Processing, pages 787–798, 2014.
- Large language models are zero-shot reasoners. Proceedings of the Advances in Neural Information Processing Systems, 35:22199–22213, 2022.
- Openimages: A public dataset for large-scale multi-label and multi-class image classification. 2(3):18, 2017.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2017.
- Visual information extraction in the wild: Practical dataset and end-to-end solution. In Proceedings of the International Conference on Document Analysis and Recognition, pages 36–53, 2023.
- Improving zero-shot visual question answering via large language models with reasoning question prompts. In Proceedings of the ACM International Conference on Multimedia, pages 4389–4400, 2023.
- Scaffolding coordinates to promote vision-language coordination in large multi-modal models. arXiv preprint arXiv:2402.12058, 2024.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, pages 12888–12900, 2022.
- Enhancing visual document understanding with contrastive learning in large visual-language models. arXiv preprint arXiv:2402.19014, 2024.
- Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. Proceedings of the Advances in Neural Information Processing Systems, 36, 2024a.
- On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023b.
- Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024b.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2016.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021.
- Infographicvqa. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
- Compositional chain-of-thought prompting for large multimodal models. arXiv preprint arXiv:2311.17076, 2023.
- OpenAI. ChatGPT, 2022.
- Training language models to follow instructions with human feedback. Proceedings of the Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pages 8748–8763, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
- Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019.
- Alpha-clip: A clip model focusing on wherever you want. arXiv preprint arXiv:2312.03818, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Vigc: Visual instruction generation and correction. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5309–5317, 2024.
- Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2555–2563, 2023a.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022a.
- Towards improving document understanding: An exploration on text-grounding via mllms. arXiv preprint arXiv:2311.13194, 2023b.
- Language models with image descriptors are strong few-shot video-language learners. Proceedings of the Advances in Neural Information Processing Systems, 35:8483–8497, 2022b.
- Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
- Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503, 2024.
- Chain-of-thought prompting elicits reasoning in large language models. Proceedings of the Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- V*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135, 2023.
- Dettoolchain: A new prompting paradigm to unleash detection ability of mllm. arXiv preprint arXiv:2403.12488, 2024.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
- Tree of thoughts: Deliberate problem solving with large language models. Proceedings of the Advances in Neural Information Processing Systems, 36, 2024.
- mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023a.
- Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023b.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023c.
- Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023.
- Filling in the blank: Rationale-augmented prompt tuning for textvqa. In Proceedings of the ACM International Conference on Multimedia, pages 1261–1272, 2023.
- Cocot: Contrastive chain-of-thought prompting for large multimodal models with multiple image inputs. arXiv preprint arXiv:2401.02582, 2024.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
- Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. Proceedings of the Advances in Neural Information Processing Systems, 36:5168–5191, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Bozhi Luan (3 papers)
- Hao Feng (83 papers)
- Hong Chen (230 papers)
- Yonghui Wang (11 papers)
- Wengang Zhou (153 papers)
- Houqiang Li (236 papers)