Good Questions Help Zero-Shot Image Reasoning (2312.01598v2)

Published 4 Dec 2023 in cs.CV

Abstract: Aligning the recent LLMs with computer vision models leads to large vision-LLMs (LVLMs), which have paved the way for zero-shot image reasoning tasks. However, LVLMs are usually trained on short high-level captions only referring to sparse focus regions in images. Such a ``tunnel vision'' limits LVLMs to exploring other relevant contexts in complex scenes. To address this challenge, we introduce Question-Driven Visual Exploration (QVix), a novel prompting strategy that enhances the exploratory capabilities of LVLMs in zero-shot reasoning tasks. QVix leverages LLMs' strong language prior to generate input-exploratory questions with more details than the original query, guiding LVLMs to explore visual content more comprehensively and uncover subtle or peripheral details. QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment. Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods, highlighting its effectiveness in bridging the gap between complex visual data and LVLMs' exploratory abilities.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (37)

Authors (7)

Kaiwen Yang (6 papers)
Tao Shen (87 papers)
Xinmei Tian (50 papers)
Xiubo Geng (36 papers)
Chongyang Tao (61 papers)
Dacheng Tao (826 papers)
Tianyi Zhou (172 papers)

Citations (5)

View on Semantic Scholar

GitHub

GitHub - kai-wen-yang/QVix: Good Questions Help Zero-Shot Image Reasoning (11 stars)

Good Questions Help Zero-Shot Image Reasoning (2312.01598v2)

Related Papers

GitHub