Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization (2504.18397v1)

Published 25 Apr 2025 in cs.CV

Abstract: Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal LLMs (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception--identifying key regions and reasoning based on them--UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT. The code is available in https://github.com/kesenzhao/UV-CoT.

PDF Abstract

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

The paper "Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization" introduces a novel framework, UV-CoT, for enhancing the interpretability and problem-solving capabilities of multimodal LLMs (MLLMs) through unsupervised learning methods. Unlike traditional models that heavily rely on textual chain-of-thought (CoT), UV-CoT leverages visual cues, addressing the underexplored area of visual CoT.

Overview

The core innovation of this work lies in its approach to incorporate image-level reasoning in multimodal models without the need for extensive human-annotated bounding-box data. The authors propose UV-CoT, which utilizes preference comparisons between model-generated bounding boxes to optimize reasoning processes, thereby eliminating the dependency on labeled data.

Methodology

The methodology is structured around two main components: automatic preference data generation and preference optimization. First, UV-CoT employs an unsupervised pipeline for preference data generation. Given an image, a target MLLM generates multiple seed bounding boxes using a template prompt, which are then evaluated by an external MLLM for quality assessment based on preference scores. This automatic pipeline circumvents the limitations of traditional methods requiring extensive labeled datasets.

Second, UV-CoT introduces a refined version of Direct Preference Optimization (DPO), named Score-DPO (sDPO). This variant optimizes the target MLLM through preference scores, enabling the model to not only rank responses but also differentiate preference intensity. This precision in optimization enhances the model's ability to focus on key visual regions and improves its reasoning ability.

Experimental Results

The experiments conducted across six datasets illustrate the efficacy of UV-CoT, demonstrating superior performance over state-of-the-art methods. In zero-shot testing across four additional datasets, UV-CoT also shows strong generalization capabilities, further supporting its robustness in handling unseen data. Notably, UV-CoT achieves these results without reliance on labeled data, marking a significant stride in data efficiency.

Implications and Future Directions

The practical implications of this research are substantial. UV-CoT reduces the dependency on costly human-annotated datasets, paving the way for scalable and economic model training solutions in various applications, such as object detection and visual question answering. The paper suggests future advancements could focus on refining bounding box accuracy and exploring adaptive learning mechanisms to further elevate the model's performance.

Theoretically, this approach could stimulate discussions on new unsupervised learning paradigms in AI research, potentially transferring the CoT reasoning concept into other modalities and domains.

Conclusion

UV-CoT represents a significant advancement in integrating visual reasoning processes into multimodal models through unsupervised learning strategies. By addressing key limitations of supervised techniques, this framework demonstrates a high upper limit of potential improvements in visual comprehension, especially in challenging and complex scenarios where traditional methods may fall short. The outcomes presented in this paper hint at promising developments and further exploration in AI-driven reasoning mechanisms.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Kesen Zhao (6 papers)
Beier Zhu (15 papers)
Qianru Sun (65 papers)
Hanwang Zhang (161 papers)

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization (2504.18397v1)