GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking (2506.01078v1)

Published 1 Jun 2025 in cs.CV and cs.AI

Abstract: Despite notable advancements in multimodal reasoning, leading Multimodal LLMs (MLLMs) still underperform on vision-centric multimodal reasoning tasks in general scenarios. This shortfall stems from their predominant reliance on logic- and knowledge-based slow thinking strategies, while effective for domains like math and science, fail to integrate visual information effectively during reasoning. Consequently, these models often fail to adequately ground visual cues, resulting in suboptimal performance in tasks that require multiple plausible visual interpretations and inferences. To address this, we present GThinker (General Thinker), a novel reasoning MLLM excelling in multimodal reasoning across general scenarios, mathematics, and science. GThinker introduces Cue-Rethinking, a flexible reasoning pattern that grounds inferences in visual cues and iteratively reinterprets these cues to resolve inconsistencies. Building on this pattern, we further propose a two-stage training pipeline, including pattern-guided cold start and incentive reinforcement learning, designed to enable multimodal reasoning capabilities across domains. Furthermore, to support the training, we construct GThinker-11K, comprising 7K high-quality, iteratively-annotated reasoning paths and 4K curated reinforcement learning samples, filling the data gap toward general multimodal reasoning. Extensive experiments demonstrate that GThinker achieves 81.5% on the challenging comprehensive multimodal reasoning benchmark M$^3$CoT, surpassing the latest O4-mini model. It also shows an average improvement of 2.1% on general scenario multimodal reasoning benchmarks, while maintaining on-par performance in mathematical reasoning compared to counterpart advanced reasoning models. The code, model, and data will be released soon at https://github.com/jefferyZhan/GThinker.

PDF Abstract

GThinker: Enhancing Multimodal Reasoning with Cue-Guided Rethinking

Introduction

The paper "GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking" addresses a critical gap in current Multimodal LLMs (MLLMs) concerning their proficiency in vision-centric multimodal reasoning tasks. Conventional MLLMs primarily rely on slow-thinking strategies, which, despite their effectiveness in domains like mathematics and science, often overlook the nuanced integration of visual cues necessary for solving complex visual reasoning tasks. To circumvent this limitation, the authors propose GThinker, an innovative model that introduces a cue-rethinking pattern, thereby enhancing the model's multimodal reasoning capabilities across a variety of scenarios.

Core Methodology

GThinker distinguishes itself with the introduction of the Cue-Rethinking Pattern, a flexible long-chain reasoning framework grounded in visual cue interpretation. This pattern allows the model to construct flexible reasoning bodies without being constrained to structured formats, iteratively reassessing and integrating visual cues to address inconsistencies. The methodology is anchored on a two-stage training pipeline: Pattern-Guided Cold Start and Incentive Reinforcement Learning.

Pattern-Guided Cold Start: This phase leverages a dataset of 7,358 high-quality, annotated reasoning paths—curated using a multimodal iterative annotation process. This approach encourages the model to develop reasoning strategies that account for the flexibility needed when tackling varied tasks and scenarios. The data pipeline utilizes leading MLLMs, iteratively refined to ensure high-quality annotations.
Incentive Reinforcement Learning: This subsequent phase employs the Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) algorithm. This reinforcement learning strategy incentivizes the model to explore diverse reasoning pathways, effectively bridging the gap across math, science, and general reasoning tasks.

Key Results and Performance

GThinker demonstrates remarkable performance benchmarks, particularly on the challenging M $^3$ CoT dataset where it surpasses existing advanced models, achieving an 81.5% overall accuracy. The results further showcase GThinker’s ability to generalize across both mathematical and commonsense reasoning tasks, outpacing several contemporaries, including non-thinking large-scale models like GPT-4o on reasoning-centric benchmarks.

Additionally, the model's design facilitates enhanced understanding and resolution of tasks dependent on visual content by effectively leveraging its cue-rethinking mechanism. This is particularly evident in tasks requiring intricate visual reasoning, where GThinker showcases superior alignment in integrating visual cues within the reasoning process.

Implications and Future Directions

The development of GThinker marks a crucial step toward more robust general reasoning capabilities in MLLMs, opening up pathways for more sophisticated multimodal reasoning applications. The research implies potential advancements in domains requiring deep comprehension of visual elements, from augmented reality to advanced human-computer interactions.

Despite its promising results, the paper acknowledges existing limitations in the availability of complex, publicly accessible datasets for multimodal reasoning. This constraint highlights an area for future exploration: the creation and expansion of datasets to further generalize and refine GThinkers' broad applicability.

In summation, GThinker exemplifies a significant stride forward in aligning multimodal LLMs with intricate, real-world visual comprehension tasks, suggesting a promising trajectory for future research and application in AI-driven multimodal reasoning.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Yufei Zhan (10 papers)
Ziheng Wu (16 papers)
Yousong Zhu (19 papers)
Rongkun Xue (4 papers)
Ruipu Luo (6 papers)
Zhenghao Chen (30 papers)
Can Zhang (69 papers)
Yifan Li (106 papers)
Zhentao He (7 papers)
Zheming Yang (6 papers)
Ming Tang (199 papers)
Minghui Qiu (58 papers)
Jinqiao Wang (76 papers)

GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking (2506.01078v1)

GThinker: Enhancing Multimodal Reasoning with Cue-Guided Rethinking

Introduction

Core Methodology

Key Results and Performance

Implications and Future Directions

Related Papers

GitHub

YouTube