The paper "Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models" (Zhan et al., 27 May 2025 ) proposes a unified visual reasoning mechanism for Large Multimodal Models (LMMs) to address their limitations in compositional visual reasoning. Existing LMMs often rely on shortcut learning, directly mapping questions to answers, which hinders their ability to perform complex reasoning tasks and can lead to hallucinations. While methods like Chain-of-Thought (CoT) and toolkit-based approaches attempt to improve reasoning, they often require multiple inference steps or external tools, leading to inefficiencies and reduced generality.
The proposed unified mechanism introduces a human-like "understand-think-answer" process that operates in a single forward pass.
- Understand: The model first analyzes the question and image to determine the necessary information for answering the question and plans how to acquire it using intrinsic capabilities (like visual grounding, object detection, text recognition, captioning, etc.). It generates structured instructions or cues to guide the gathering of relevant visual information. This is a flexible process tailored to the specific question, rather than relying on fixed reasoning paths or external tool calls. If relevant information is not found, the model indicates this, avoiding misleading outputs.
- Think: Based on the visual cues gathered during the understanding phase, the model engages in self-prompted contextual thinking. Leveraging the capabilities of the underlying LLM, it processes the visual information and the original question.
- Answer: Finally, the model generates the ultimate answer based on the understanding and thinking processes. This entire sequence, from understanding to thinking to answering, is performed in a single autoregressive generation pass.
To train LMMs to follow this mechanism, the authors curated a dataset of 334K visual instruction samples. They developed a semi-automatic expert-supervised data engine for this purpose. The process involves:
- Progressive Annotation with AI Expert: Using a state-of-the-art LMM (Qwen2-VL-72B) as an AI expert to analyze questions, plan information acquisition steps, and annotate tasks it excels at (e.g., global captions, text recognition). Instructions guide the AI to identify key objects/entities and the necessary intrinsic capabilities.
- Curation with Human Expert: Human experts complete tasks the AI struggles with, particularly multi-object visual grounding or partial text recognition. They also review the AI-generated annotations for quality, ensuring the understanding process is logical and annotations are accurate, and filtering out overly simplistic samples.
The curated dataset covers both general scenes and text-rich scenes and incorporates multi-task instruction-following data from various public datasets, including VQA datasets (GQA, VAW, VizWiz, ChartQA, DUE_Benchmark, TextVQA), instruction data (LLaVA, ALLaVA, LVIS-Instruct4V), and caption data (ShareGPT-4V).
Based on this mechanism and data, the authors developed Griffon-R, an LMM built upon the Griffon v2 architecture (Zhan et al., 14 Mar 2024 ). Griffon-R utilizes a single-branch, high-resolution structure with a visual encoder (CLIP-ViT-L/14-336), a vision-language connector, and an LLM (Gemma9B). It is trained in a multi-stage process:
- Stage 1: Pretraining the vision-language connector with visual captioning data (ShareGPT-4V).
- Stage 2: Pretraining the whole model on a diverse set of perception tasks including Referring Expression Comprehension/Generation, Visual Grounding, and Object Detection, as well as general language and instruction following data.
- Stage 3: Fine-tuning the whole model using the curated visual reasoning data combined with other general VQA and instruction data. Standard cross-entropy loss is used.
Experiments were conducted on various visual reasoning benchmarks (CLEVR, VSR, GQA, V-Star, TallyQA) and general multimodal benchmarks (MMBench, ScienceQA, TextVQA, SEED, LLaVA Bench, POPE).
Results demonstrate that Griffon-R achieves state-of-the-art or competitive performance across these benchmarks, particularly excelling in compositional visual reasoning tasks like VSR, CLEVR, V-Star, and TallyQA. It also shows strong performance on general multimodal tasks and text-rich VQA, surpassing many advanced LMMs and methods specifically designed with CoT or toolkits. Ablation studies confirm the effectiveness of the proposed mechanism and the curated data. The understanding quality, measured by REC performance, is high. The mechanism shows better accuracy and significantly faster inference time compared to a toolkit-based approach on V-Star. Training with the curated annotations aligned with the mechanism is shown to be crucial for achieving robust visual reasoning.
The authors note limitations, including potential increased response time for complex scenarios involving associated objects due to the growing output sequence, and the inheritance of limitations from the AI expert model used in data curation, such as potential inaccuracies (mitigated by human verification). They also highlight data usage restrictions related to the source models/datasets.
In conclusion, the paper presents a novel "understand-think-answer" mechanism and a corresponding high-quality dataset for training LMMs in compositional visual reasoning. The resulting model, Griffon-R, demonstrates improved reasoning capabilities and overall multimodal performance in an efficient, end-to-end manner without external tool reliance.