Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation (2401.10005v2)

Published 18 Jan 2024 in cs.CV and cs.CL

Abstract: The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-LLMs (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a LLM, designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. https://commoncrawl.org/.
  2. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
  3. VQA: Visual question answering. In ICCV, 2015.
  4. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
  5. Language models are few-shot learners. In NeurIPS, 2020.
  6. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  7. Visual instruction tuning with polite flamingo. arXiv preprint arXiv:2307.01003, 2023a.
  8. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023b.
  9. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023c.
  10. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402, 2023.
  11. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
  12. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. IJCV, 127:398–414, 2016.
  13. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  14. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14953–14962, 2023.
  15. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
  16. Captioning images taken by people who are blind. In ECCV, 2020.
  17. BLIVA: A simple multimodal llm for better handling of text-rich visual questions. arXiv preprint arXiv:2308.09936, 2023.
  18. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  19. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  20. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
  21. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  22. Microsoft COCO: Common objects in context. In ECCV, 2014.
  23. Visual spatial reasoning. TACL, 2023a.
  24. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for GPT-4V (ision), LLaVA-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023b.
  25. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023c.
  26. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023d.
  27. Visual instruction tuning. In NeurIPS, 2023e.
  28. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In NeurIPS Track on Datasets and Benchmarks, 2021.
  29. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
  30. Generation and comprehension of unambiguous object descriptions. In CVPR, 2015.
  31. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  32. Learning by asking questions. In CVPR, 2018.
  33. Generating natural questions about an image. In ACL, 2016.
  34. OpenAI. OpenAI: Introducing ChatGPT. https://openai.com/blog/chatgpt, 2022.
  35. OpenAI. GPT-4 Technical Report. ArXiv, abs/2303.08774, 2023.
  36. Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
  37. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
  38. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306, 2023.
  39. Learning transferable visual models from natural language supervision. In ICML, 2021.
  40. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. NeurIPS Data-centric AI Workshop, 2021.
  41. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  42. A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
  43. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  44. Learning to caption images through a lifetime by asking questions. In ICCV, 2019.
  45. Textcaps: a dataset for image captioningwith reading comprehension. In ECCV, 2020.
  46. Towards VQA models that can read. In CVPR, 2019.
  47. ViperGPT: Visual inference via python execution for reasoning. ICCV, 2023.
  48. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
  49. K-vqg: Knowledge-aware visual question generation for common-sense acquisition. In WACV, 2023.
  50. Cider: Consensus-based image description evaluation. In CVPR, 2015.
  51. Git: A generative image-to-text transformer for vision and language. TMLR, 2022.
  52. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
  53. Visual curiosity: Learning to ask questions to learn visual recognition. In CoRL, 2018.
  54. ReCo: Region-controlled text-to-image generation. In CVPR, 2023.
  55. GPT4RoI: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023a.
  56. LLaVAR: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023b.
  57. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  58. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Kohei Uehara (9 papers)
  2. Nabarun Goswami (8 papers)
  3. Hanqin Wang (2 papers)
  4. Toshiaki Baba (1 paper)
  5. Kohtaro Tanaka (1 paper)
  6. Tomohiro Hashimoto (2 papers)
  7. Kai Wang (624 papers)
  8. Rei Ito (2 papers)
  9. Takagi Naoya (1 paper)
  10. Ryo Umagami (2 papers)
  11. Yingyi Wen (6 papers)
  12. Tanachai Anakewat (1 paper)
  13. Tatsuya Harada (142 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets