GRIT: Teaching MLLMs to Think with Images (2505.15879v1)

Published 21 May 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

Authors (9)

Yue Fan (46 papers)
Xuehai He (26 papers)
Diji Yang (10 papers)
Kaizhi Zheng (11 papers)
Ching-Chen Kuo (5 papers)
Yuting Zheng (4 papers)
Sravana Jyothi Narayanaraju (1 paper)
Xinze Guan (6 papers)
Xin Eric Wang (74 papers)

Summary

Overview of the GRIT Method for Enhanced Multimodal LLM Reasoning

The research paper titled "GRIT: Teaching MLLMs to Think with Images" presents an innovative approach to advancing reasoning in Multimodal LLMs (MLLMs), particularly emphasizing their capability to reason with visual inputs. The method, termed Grounded Reasoning with Images and Text (GRIT), integrates visual data seamlessly into the reasoning process of MLLMs, thereby offering a significant enhancement in the model's interpretative and analytical capabilities when dealing with vision-language tasks.

Context and Motivation

While existing visual reasoning models primarily rely on natural language to articulate reasoning processes, they often fall short in incorporating explicit visual information into these reasoning chains. These models typically do not generate reasoning content that is visibly tied to image elements, leading to insufficiently grounded reasoning outputs.

To bridge this gap, GRIT introduces a novel framework where reasoning chains are formulated by interweaving natural language and explicit visual references, such as bounding box coordinates. These coordinates serve as direct pointers to relevant areas within an image, allowing the model to consult visual evidence explicitly during the reasoning phase.

Methodological Contributions

The core contribution of this work is the development of a reinforced learning technique called GRPO-GR, built upon the Group-Relative Policy Optimization (GRPO) algorithm. This method emphasizes the following key aspects:

Grounded Reasoning Paradigm: GRIT enables MLLMs to produce reasoning chains that include explicit mentions of visual regions (via bounding box coordinates). This grounded reasoning paradigm does not necessitate additional pixel inputs after the initial image processing, allowing the model to leverage its understanding of the input image autonomously.
Data Efficiency: A remarkable feature of GRIT is its extreme data efficiency. It requires minimal training data—only about 20 image-question-answer triplets from existing datasets—demonstrating the method's robustness and adaptability to produce coherent visual reasoning outputs without the dependency on extensive annotated datasets.
Reinforcement Learning with GRPO-GR: The reinforcement learning component of GRIT is finely tuned to optimize grounded reasoning. The GRPO-GR algorithm employs innovative reward structures that focus on accuracy and coherence of the answer alongside the syntactic correctness of the reasoning format, promoting models to naturally integrate both grounding and logical reasoning.

Empirical Evaluation

The evaluation of GRIT involves training state-of-the-art models like Qwen 2.5-VL and InternVL 3 with minimal training data. The models are tested on tasks requiring both visual question answering and grounding, showing significant performance in producing grounded reasoning chains.

Integration of Reasoning and Grounding: The GRIT-trained models exhibit high coherence between textual reasoning components and visual references, consequently increasing the interpretability of the models' outputs in vision-language tasks.
Improvement Over Traditional Methods: Compared to baselines that rely on traditional training methods, GRIT-trained models show superior performance in generating accurate and visually coherent reasoning, indicating a successful synthesis of the models' grounding and reasoning faculties.

Future Directions and Implications

The GRIT method’s implications for the future of AI are substantial. By allowing AI models to think and reason with images more effectively, GRIT paves the way for developing more nuanced AI systems capable of advanced interpretive tasks across various domains. Some potential future directions might include:

Expanding Data Diversity: While GRIT showcases efficiency with minimal data, increasing the diversity of training datasets could broaden the generalizability and robustness of the models further.
Scaling Up Models: Applying GRIT to even larger-scale models could further enhance its reasoning capabilities and facilitate its use in more complex and resource-intensive tasks.

Overall, the GRIT method underscores a pivotal advancement in the field of multimodal reasoning, setting a new standard for how visual data is integrated into machine reasoning processes. By doing so, it enhances both the practical applicability and theoretical understanding of MLLMs in handling multimodal tasks.

Related Papers

Find Related Papers

Tweets

https://twitter.com/f14bertolotti/status/1925837910860845543

https://twitter.com/_akhaliq/status/1925930690194833483

https://twitter.com/awsaf49/status/1925954057564217587

https://twitter.com/GptMaestro/status/1936098952824471797

https://twitter.com/CSVisionPapers/status/1925914582268174392