Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation (2401.10005v2)
Abstract: The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-LLMs (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a LLM, designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.
- https://commoncrawl.org/.
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
- VQA: Visual question answering. In ICCV, 2015.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
- Language models are few-shot learners. In NeurIPS, 2020.
- Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- Visual instruction tuning with polite flamingo. arXiv preprint arXiv:2307.01003, 2023a.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023b.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023c.
- A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402, 2023.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. IJCV, 127:398–414, 2016.
- LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14953–14962, 2023.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
- Captioning images taken by people who are blind. In ECCV, 2020.
- BLIVA: A simple multimodal llm for better handling of text-rich visual questions. arXiv preprint arXiv:2308.09936, 2023.
- GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Microsoft COCO: Common objects in context. In ECCV, 2014.
- Visual spatial reasoning. TACL, 2023a.
- Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for GPT-4V (ision), LLaVA-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023b.
- Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023c.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023d.
- Visual instruction tuning. In NeurIPS, 2023e.
- Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In NeurIPS Track on Datasets and Benchmarks, 2021.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2015.
- OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
- Learning by asking questions. In CVPR, 2018.
- Generating natural questions about an image. In ACL, 2016.
- OpenAI. OpenAI: Introducing ChatGPT. https://openai.com/blog/chatgpt, 2022.
- OpenAI. GPT-4 Technical Report. ArXiv, abs/2303.08774, 2023.
- Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
- Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
- Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. NeurIPS Data-centric AI Workshop, 2021.
- LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Learning to caption images through a lifetime by asking questions. In ICCV, 2019.
- Textcaps: a dataset for image captioningwith reading comprehension. In ECCV, 2020.
- Towards VQA models that can read. In CVPR, 2019.
- ViperGPT: Visual inference via python execution for reasoning. ICCV, 2023.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
- K-vqg: Knowledge-aware visual question generation for common-sense acquisition. In WACV, 2023.
- Cider: Consensus-based image description evaluation. In CVPR, 2015.
- Git: A generative image-to-text transformer for vision and language. TMLR, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
- Visual curiosity: Learning to ask questions to learn visual recognition. In CoRL, 2018.
- ReCo: Region-controlled text-to-image generation. In CVPR, 2023.
- GPT4RoI: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023a.
- LLaVAR: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023b.
- Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Kohei Uehara (9 papers)
- Nabarun Goswami (8 papers)
- Hanqin Wang (2 papers)
- Toshiaki Baba (1 paper)
- Kohtaro Tanaka (1 paper)
- Tomohiro Hashimoto (2 papers)
- Kai Wang (624 papers)
- Rei Ito (2 papers)
- Takagi Naoya (1 paper)
- Ryo Umagami (2 papers)
- Yingyi Wen (6 papers)
- Tanachai Anakewat (1 paper)
- Tatsuya Harada (142 papers)