V-Thinker: Vision-Centric Interactive Reasoning

Updated 7 November 2025

V-Thinker is a vision-centric paradigm where multimodal models actively interact with images using code-driven modifications and iterative reasoning.
It integrates perceptual grounding, interactive visual reasoning within a sandboxed environment, and a data evolution flywheel for scalable dataset synthesis.
The system employs reinforcement learning techniques, like Group Relative Policy Optimization, to optimize both reasoning accuracy and effective visual tool usage.

V-Thinker refers to a family of research directions, methodologies, and systems aimed at enabling large multimodal models (LMMs), especially vision-LLMs (VLMs), to perform vision-centric reasoning through interactive thinking with images. This paradigm advances the field from traditional image-assisted chain-of-thought reasoning—where models process images passively—to tightly coupled visual-linguistic reasoning where the model actively edits, manipulates, and reasons over visual states using code-driven tools, under the guidance of complex data curricula and interactive reinforcement learning. Key to V-Thinker are end-to-end architectures, scalable dataset synthesis, tool-driven visual reasoning, and benchmarks that emphasize not only answer correctness but the faithfulness and transparency of the multimodal reasoning process (Qiao et al., 6 Nov 2025).

1. Vision-Centric Interactive Reasoning: Definition and Motivation

V-Thinker formalizes a vision-centric interactive reasoning paradigm wherein LMMs are endowed with the capacity to directly interact with, edit, and annotate images as an integral part of their multi-step reasoning process. Unlike previous approaches—where visual inputs serve as static context and language chains dominate inference—V-Thinker systems alternate between logical thought and executable visual actions at each reasoning step. This design aims to:

Overcome hallucination and linguistic shortcutting by forcing direct, stepwise grounding to the visual content.
Enable explicit integration of human-like problem-solving tactics, namely persistent, iterative modification, visual exploration, and code-driven feedback.
Bridge the gap between stepwise language CoT and human reasoning-in-visual-space, especially for domains like geometry, data visualization, and spatial math.

The shift to image-interactive thinking is motivated by the limitations encountered in chain-of-thought-only systems: disconnect from the actual image, insufficient fine-grained grounding, and evaluation frameworks that fail to penalize reasoning hallucinations or improper visual manipulations.

2. System Architecture and Components

V-Thinker systems are characterized by an explicit division between perceptual alignment, interactive visual reasoning, and tool-oriented code execution:

Perceptual grounding stage: The model undergoes supervised pretraining to ground linguistic references in point-level and region-level visual elements. Datasets such as V-Perception-40K sample element types (points, lines, angles, circles), their relations, complexity (element count), and knowledge concepts to anchor visual reasoning.
Interactive reasoning stage: The system operates in a sandboxed code-execution environment, typically leveraging a Python backend. At each reasoning step $t$ $t$ , the model emits:
- A linguistic rationale $r_t$
- A code snippet $c_t$ (e.g., draw_line(...), label_point(...))
- which is executed in an environment $\mathcal{E}$ to yield the next visual state $I_{t+1}$ .

Formally, the reasoning trajectory is: $\mathcal{F}: (Q, I_0) \mapsto (R, A), \quad R = \{(r_t, c_t, I_{t+1})\}_{t=1}^{T}, \quad I_{t+1} = \mathcal{E}(I_t, c_t)$ The generated trajectories consist of a mixture of natural language reasoning and visual edits, externalizing the agent's thought process in both modalities.

Data Evolution Flywheel: An automated mechanism that synthesizes and calibrates large-scale interactive reasoning datasets. This component enables exponential growth of knowledge domains and toolsets by employing co-evolution between knowledge concepts and tool combinations, rule-based repair, and human/LLM-in-the-loop validation for correctness and diversity.
Visual Progressive Training Curriculum: Curriculum learning is structured in two stages: supervised fine-tuning on perception and code-driven reasoning, followed by reinforcement learning (RL) to optimize the full vision-centric reasoning pipeline.

3. Reinforcement Learning Framework

V-Thinker employs end-to-end RL, specifically Group Relative Policy Optimization (GRPO), to optimize the multimodal reasoning agent for both reasoning accuracy and code-driven tool usage. The RL objective incorporates auxiliary losses and reward components:

Perception SFT Loss: $\mathcal{L}_{\mathrm{SFT}}(\theta) = \mathbb{E}_{(Q, A) \sim \mathcal{D}_{\text{perception}}} [ -\log P_\theta(A \mid Q) ]$
Code-driven RL Loss: $\mathcal{L}_{\text{RL}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \frac{1}{\sum_{j=1}^G T_j} \sum_{j=1}^G \sum_{t=1}^{T_j} \min \left( \delta_{j,t} \tilde{A}^{(t)}, \operatorname{clip}(\delta_{j,t}, 1-\epsilon_l, 1+\epsilon_h) \tilde{A}^{(t)} \right) \right]$ with $\delta_{j,t}$ the importance sampling ratio and $\tilde{A}^{(t)}$ the group-normalized advantage.
Reward Function: $R(\tau) = R_{\mathrm{acc}}(\tau) + \lambda_1 \cdot R_{\mathrm{format}}(\tau) + \lambda_2 \cdot \mathbb{I}_{R_{\mathrm{acc}}(\tau) > 0} \cdot R_{\mathrm{tool}}(\tau)$ where $R_\mathrm{acc}$ is answer accuracy, $R_\mathrm{format}$ is code formatting correctness, $R_\mathrm{tool}$ is tool usage, and $\lambda_1$ , $\lambda_2$ are empirically set.

The reward only credits tool usage if the answer is correct, ensuring tool calls are both necessary and effective.

4. Data Evolution Flywheel: Scalable Interactive Reasoning Data Generation

Scaling vision-centric reasoning requires a large, diverse, and evolvable dataset. The Data Evolution Flywheel automates:

Knowledge co-evolution: At each iteration, new combinations of knowledge points and tool capabilities are generated and expanded via feed-forward functions $\Phi_K, \Phi_T$ , preventing redundancy and expanding the task space hierarchically.
Quality calibration: Checker modules validate each synthetic data point for correct answer, valid image rendering, and internal consistency of reasoning states. Erroneous samples are repaired or reconstructed in an iterative loop.
Difficulty stratification: Both parallel and sequential task expansions allow for increasingly complex, multi-step reasoning episodes, validated until convergence.

After multiple iterations, the knowledge system and tool set notably expand (e.g., 50-fold after five cycles), supporting robust generalization and transfer across domains.

5. VTBench and Empirical Benchmarks

VTBench is a comprehensive, expert-curated benchmark targeting three essential axes of vision-centric interactive reasoning:

Perception: Localizing and identifying visual elements (e.g., point coordinates).
Instruction-Guided Interaction: Following directions to manipulate image components via code-driven tools (e.g., draw, label, highlight).
Interactive Reasoning: Addressing complex, multi-stage tasks, such as adding geometrical constructions or sequential annotations to support inference.

All tasks are annotated with interaction graphs and perceptual coordinates, verified by a panel of experts. Model performance on VTBench is evaluated by both execution-based (code correctness, rendered output match) and reasoning-based (LLM-graded answer correctness) metrics.

Empirical results demonstrate that V-Thinker surpasses both open- and closed-source strong baselines (e.g., GPT-4o, InternVL3, Qwen2.5-VL-7B) across all interactive reasoning categories, with especially large gains (+25.8% over Qwen2.5-VL-7B for instruction-guided tasks). Ablation studies confirm the necessity of both perception alignment and RL; removal of either markedly degrades performance.

6. Comparative Analysis and Theoretical Advances

V-Thinker distinguishes itself from previous approaches in several key aspects:

Prior work either treated images as passive context with limited or pre-scripted tool actions, or coupled tool usage tightly to specific tasks, severely curtailing generalization.
Systems relying on image-to-code conversions (e.g., DeepSketcher) introduce error and are brittle to input variations.
Tool use in prior models is often static or controversial, without reward shaping to enforce both correctness and utility.

V-Thinker's code-driven manipulation at every reasoning step, supported by scalable data curation and curriculum alignment, yields trajectories where visual feedback is both transparent and causally coupled to the reasoning chain—enabling interpretability and effective exploration in RL.

The algorithmic cycle is:

At each $t$ , the agent emits $(r_t, c_t)$ , an environment $\mathcal{E}$ executes $c_t$ to yield $I_{t+1}$ , and the new triplet is added to the trajectory. At the end of the episode, rewards are assigned based on the match between the rendered trajectory and ground truth, correct answer, tool usage, and code quality.

7. Impact, Limitations, and Future Directions

V-Thinker marks a milestone in formalizing and systematizing vision-centric interactive reasoning:

It advances the field beyond image-assisted CoT to a regime of tightly coupled logical-visual reasoning, externalizing the agent's internal state in both code and image space.
The Data Evolution Flywheel and the training curriculum enable scaling of both data and capabilities without tight coupling to any specific tool suite or task format.
The VTBench benchmark creates a new standard for evaluating visual reasoning systems on fidelity, transparency, and interactive grounding.

Anticipated extensions include broader unification of visual and linguistic tool use, further development of structured feedback for RL, and direct application to domains requiring transparent reasoning and verifiable visual edits (e.g., STEM education, medical diagnostics).

Remaining limitations involve further scaling to real-world, noisy image distributions, integrating more open-ended tool sets, and bridging to interactive video and 3D reasoning.

Summary Table: V-Thinker System Components

Component	Description	Role
Data Evolution Flywheel	Automated dataset synthesis/calibration	Diversity, scalability, difficulty
Perception Alignment	Fine-tuning on element/coordinate recognition	Local visual grounding
Interactive Reasoning Alignment	Curriculum SFT and RL with tool use	Code-driven visual interaction
End-to-End RL (GRPO)	Group-based PPO with code/tool reward	Reasoning and tool policy learning
VTBench	Expert-verified interactive task benchmark	Evaluation and comparison

V-Thinker and related research provide a unified framework for interactive, vision-centric multimodal reasoning, combining scalable dataset generation, curriculum learning, code-driven image manipulation, and rigorous benchmarking. By enabling LMMs to think by interacting with images, V-Thinker sets the foundation for advances in interpretable, generalizable, and human-like AI reasoning systems (Qiao et al., 6 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

V-Thinker: Interactive Thinking with Images (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to V-Thinker.