Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compositional Chain-of-Thought Prompting for Large Multimodal Models (2311.17076v3)

Published 27 Nov 2023 in cs.CV, cs.AI, cs.CL, and cs.LG
Compositional Chain-of-Thought Prompting for Large Multimodal Models

Abstract: The combination of strong visual backbones and LLM reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs. Code: https://github.com/chancharikmitra/CCoT

Compositional Chain-of-Thought Prompting for Large Multimodal Models

The paper "Compositional Chain-of-Thought Prompting for Large Multimodal Models" presents a novel approach to enhance the compositional reasoning capabilities of Large Multimodal Models (LMMs) without the necessity for model fine-tuning or costly annotated scene graph data. This approach addresses a significant challenge prevalent in LMMs: their current limitation in effectively capturing the compositionality of visual scenes, especially in terms of object attributes and inter-object relationships.

Overview of the Methodology

The proposed method, termed Compositional Chain-of-Thought (CCoT), employs a zero-shot chain-of-thought prompting strategy. This method consists of two critical stages:

  1. Scene Graph Generation: Initially, the LMM is prompted to generate a scene graph (SG) representation, facilitating the extraction of compositional knowledge from the visual input. Scene graphs are structured descriptors encapsulating objects along with their attributes and relationships, better aligning with the intrinsic complexity of visual data compared to plain image captions. The paper highlights that this SG generation does not require ground-truth annotations, leveraging instead the LMM's intrinsic capabilities to produce them on-the-fly.
  2. Response Generation: The generated SG is then used as an intermediary reasoning step in a secondary prompt, which provides a structured basis for the LMM's response to a given multimodal task. This method capitalizes on the Zero-Shot Chain-of-Thought (CoT) reasoning, supplying the LMM with a comprehensive contextual foundation without the need for further training. The SG, encapsulating detailed image-context, amplifies the LMM's ability to comprehend and respond to complex visual scenes and tasks.

Results and Implications

Quantitative evaluation reveals that the CCoT approach significantly boosts the LMM performance across several benchmark datasets, including Winoground, WHOOPS!, SEEDBench, MMBench, and LLaVA-Bench. Specifically, improvements are evident in both general multimodal reasoning and tasks demanding high compositionality. Notably, this performance enhancement is achieved without inducing the model to suffer from catastrophic forgetting—a common challenge when models undergo fine-tuning.

Contributions and Theoretical Implications

  • Enhanced Multimodal Understanding: The CCoT method systematically leverages intermediate SGs to provide richer contextual representations, improving the LMM's compositional and multimodal reasoning. This stands as a salient point against traditional end-to-end models that might overlook finer semantic relationships or compositional nuances present within visual scenes.
  • Broad Applicability: Unlike methods requiring fine-tuning on domain-specific datasets, the suggestive zero-shot nature of CCoT promotes broadened applicability across different LMM architectures, showcasing its flexibility and computational efficiency.
  • Structured Representation Advantage: The paper effectively argues for scene graphs over traditional image captioning. The hierarchical, structured nature of SGs captures more of the syntactic and semantic complexities present in real-world scenarios, aiding in the resolution of tasks involving intricate visual relationships.

Future Directions

The promising results from the Compositional Chain-of-Thought prompting introduce several avenues for future investigation. One potential direction is the exploration of expanding the SG generation prompts to incorporate even more dimensions of visual contextuality, such as temporal changes or more abstract conceptual mappings, thereby further enriching the information available to LMMs. Additionally, evaluating the scalability of CCoT across even larger and more diverse datasets could offer deeper insights into the robustness and adaptability of this methodology in more varied contexts.

Overall, the CCoT methodology provides a significant contribution to the field of visual reasoning and multimodal machine learning. It propounds an innovative, low-cost, and effective strategy for improving the compositional understanding of LMMs without the high computational and data annotation costs traditionally associated with such improvements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
  2. FETA: Towards specializing foundational models for expert task applications. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  3. Deep compositional question answering with neural module networks. ArXiv, abs/1511.02799, 2015a.
  4. Neural module networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 39–48, 2015b.
  5. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  6. Bringing image scene structure to video via frame-clip consistency of object tokens. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022.
  7. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023.
  8. Compositional video synthesis with action graphs. In ICML, 2021.
  9. Object level visual reasoning in videos. In ECCV, pages 105–121, 2018.
  10. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  11. Graph of thoughts: Solving elaborate problems with large language models. ArXiv, abs/2308.09687, 2023.
  12. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. ArXiv, abs/2303.07274, 2023.
  13. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  14. Uniter: Universal image-text representation learning. In ECCV, 2020.
  15. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2022.
  16. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  17. A survey on in-context learning. 2022.
  18. Teaching structured vision & language concepts to vision & language models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2657–2668, 2022.
  19. Palm-e: An embodied multimodal language model. In International Conference on Machine Learning, 2023.
  20. Cyclip: Cyclic contrastive language-image pretraining. arXiv preprint arXiv:2205.14459, 2022.
  21. Multimodal-gpt: A vision and language model for dialogue with humans. ArXiv, abs/2305.04790, 2023.
  22. Visual programming: Compositional visual reasoning without training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14953–14962, 2022.
  23. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2019.
  24. Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems (NIPS), 2018.
  25. Spatio-temporal action graph networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
  26. Learning canonical representations for scene graph to image generation. In European Conference on Computer Vision, 2020.
  27. Object-region video transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  28. Incorporating structured representations into pretrained vision \& language models using scene graphs. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  29. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6693–6702, 2019.
  30. Learning object detection from captions via textual scene attributes. ArXiv, abs/2009.14558, 2020.
  31. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
  32. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
  33. Inferring and executing programs for visual reasoning. 2017 IEEE International Conference on Computer Vision (ICCV), pages 3008–3017, 2017.
  34. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1219–1228, 2018.
  35. Large language models are zero-shot reasoners. ArXiv, abs/2205.11916, 2022a.
  36. Large language models are zero-shot reasoners. ArXiv, abs/2205.11916, 2022b.
  37. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  38. Boosting logical reasoning in large language models through a new framework: The graph of thought. ArXiv, abs/2308.08614, 2023.
  39. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125, 2023a.
  40. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022.
  41. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
  42. Oscar: Object-semantics aligned pre-training for vision-language tasks. ECCV 2020, 2020.
  43. Microsoft coco: Common objects in context. In ECCV, 2014.
  44. Improved baselines with visual instruction tuning, 2023a.
  45. Visual instruction tuning. In NeurIPS, 2023b.
  46. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023c.
  47. Learn to explain: Multimodal reasoning via thought chains for science question answering. ArXiv, abs/2209.09513, 2022.
  48. Chameleon: Plug-and-play compositional reasoning with large language models. ArXiv, abs/2304.09842, 2023.
  49. Fairness-guided few-shot prompting for large language models. ArXiv, abs/2303.13217, 2023.
  50. Crepe: Can vision-language foundation models reason compositionally? ArXiv, abs/2212.07796, 2022.
  51. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3190–3199, 2019.
  52. Something-else: Compositional action recognition with spatial-temporal interaction networks. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  53. Rethinking the role of demonstrations: What makes in-context learning work? ArXiv, abs/2202.12837, 2022.
  54. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  55. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  56. Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv, abs/2307.16789, 2023.
  57. Differentiable scene graphs. In WACV, 2020.
  58. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  59. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2019.
  60. Scienceqa: a novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23:289 – 301, 2022.
  61. Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761, 2023.
  62. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. ArXiv, abs/2303.17580, 2023.
  63. Modular visual question answering via code generation. ArXiv, abs/2306.05392, 2023.
  64. Vipergpt: Visual inference via python execution for reasoning. ArXiv, abs/2303.08128, 2023.
  65. Lxmert: Learning cross-modality encoder representations from transformers. pages 5099–5110. Association for Computational Linguistics, 2019.
  66. Ul2: Unifying language learning paradigms. In International Conference on Learning Representations, 2022.
  67. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
  68. Better zero-shot reasoning with self-adaptive prompting. In Annual Meeting of the Association for Computational Linguistics, 2023.
  69. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. ArXiv, abs/2305.03453, 2023a.
  70. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Annual Meeting of the Association for Computational Linguistics, 2023b.
  71. Videos as space-time region graphs. In ECCV, 2018.
  72. Self-consistency improves chain of thought reasoning in language models. ArXiv, abs/2203.11171, 2022a.
  73. Language models with image descriptors are strong few-shot video-language learners. ArXiv, abs/2205.10747, 2022b.
  74. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021.
  75. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
  76. Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv, abs/2303.04671, 2023.
  77. Expertprompting: Instructing large language models to be distinguished experts. ArXiv, abs/2305.14688, 2023.
  78. Scene Graph Generation by Iterative Message Passing. In CVPR, pages 3097–3106, 2017.
  79. Panoptic scene graph generation. In European Conference on Computer Vision, 2022.
  80. Tree of thoughts: Deliberate problem solving with large language models. ArXiv, abs/2305.10601, 2023a.
  81. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. ArXiv, abs/2305.16582, 2023b.
  82. mplug-owl: Modularization empowers large language models with multimodality. ArXiv, abs/2304.14178, 2023a.
  83. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. ArXiv, abs/2311.04257, 2023b.
  84. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4):3208–3216, 2021.
  85. When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023.
  86. Automatic chain of thought prompting in large language models. ArXiv, abs/2210.03493, 2022.
  87. Multimodal chain-of-thought reasoning in language models. ArXiv, abs/2302.00923, 2023.
  88. Svit: Scaling up visual instruction tuning. ArXiv, abs/2307.04087, 2023.
  89. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221, 2022.
  90. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. ArXiv, abs/2310.16436, 2023.
  91. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chancharik Mitra (5 papers)
  2. Brandon Huang (7 papers)
  3. Trevor Darrell (324 papers)
  4. Roei Herzig (34 papers)
Citations (45)