CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs (2401.02582v1)
Abstract: When exploring the development of AGI, a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.
- “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022.
- “Dnagpt: A generalized pretrained tool for multiple dna sequence analysis tasks,” bioRxiv, pp. 2023–07, 2023.
- “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
- “Llmva-gebc: Large language model with video adapter for generic event boundary captioning,” arXiv preprint arXiv:2306.10354, 2023.
- “Promptcap: Prompt-guided task-aware image captioning,” arXiv preprint arXiv:2211.09699, 2022.
- “Fine-tuning pre-trained language models with noise stability regularization,” arXiv preprint arXiv:2206.05658, 2022.
- “Gpt-4v (ision) as a social media analysis engine,” arXiv preprint arXiv:2311.07547, 2023.
- “Unbiased multi-modality guidance for image inpainting,” in European Conference on Computer Vision. Springer, 2022, pp. 668–684.
- “Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,” arXiv preprint arXiv:2310.16436, 2023.
- “Compositional chain-of-thought prompting for large multimodal models,” arXiv preprint arXiv:2311.17076, 2023.
- “Openflamingo: An open-source framework for training large autoregressive vision-language models,” arXiv preprint arXiv:2308.01390, 2023.
- “Mmicl: Empowering vision-language model with multi-modal in-context learning,” arXiv preprint arXiv:2309.07915, 2023.
- OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023.
- “Gemini: A family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
- “Winoground: Probing vision and language models for visio-linguistic compositionality,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5238–5248.
- “Mmbench: Is your multi-modal model an all-around player?,” arXiv preprint arXiv:2307.06281, 2023.
- “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22199–22213, 2022.
- “Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15211–15222.
- “Raven: A dataset for relational and analogical visual reasoning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5317–5327.
- “Language is not all you need: Aligning perception with language models,” arXiv preprint arXiv:2302.14045, 2023.
- “Factify 2: A multimodal fake news and satire news dataset,” arXiv preprint arXiv:2304.03897, 2023.
- Daoan Zhang (24 papers)
- Junming Yang (7 papers)
- Hanjia Lyu (53 papers)
- Zijian Jin (5 papers)
- Yuan Yao (292 papers)
- Mingkai Chen (5 papers)
- Jiebo Luo (355 papers)