Papers
Topics
Authors
Recent
2000 character limit reached

MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Published 29 Dec 2025 in cs.AI | (2512.23412v1)

Abstract: Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.

Summary

  • The paper introduces an RL-trained agentic framework that interleaves internal reasoning with external tool invocation for dynamic multimodal decision-making.
  • It employs a modified GRPO algorithm with step-wise normalization to balance performance across varied tool-call complexities.
  • Results demonstrate that even smaller models can achieve SOTA performance on cross-modal QA tasks through precise tool integration and local retrieval mechanisms.

MindWatcher: Agentic Multimodal Tool-Integrated Reasoning via Reinforcement Learning

Motivation and Introduction

MindWatcher addresses persistent limitations in existing LLM-centric Tool-Integrated Reasoning (TIR) agents, particularly their inability to autonomously invoke and effectively coordinate multiple external tools for real-world problem solving. Most prior TIR systems are constrained to text-based retrieval, lack agentic multimodal capabilities, and suffer from poor adaptability in open-domain, multi-step, cross-modal environments. MindWatcher integrates interleaved thinking and multimodal chain-of-thought (CoT) reasoning in an agentic framework, designed to flexibly alternate between internal reasoning and tool invocation, and directly manipulate visual inputs during inference.

The system abandons standard SFT for agent training and instead leverages continuous RL in real and simulated environments. This approach enables MindWatcher to achieve robust autonomous planning, execution, and tool-use behaviors, surpassing parametric knowledge bottlenecks.

Architecture and Working Paradigm

The reasoning process in MindWatcher is formalized as an MDP, with agent actions consisting of unified thought and tool_call segments. The model serializes these segments within an autoregressive loop (> …, <tool_call>…</tool_call>). The action space combines both cognitive reasoning and physical tool executions, enabling complex multimodal CoT trajectories and precise visual operations during inference. Figure 1

Figure 1: Paradigm: MindWatcher alternates between reasoning and tool invocation within multimodal CoT, guided by RL, leveraging a local high-quality retrieval corpus.

The agent processes each input by iteratively planning, triggering relevant tool calls, and updating its internal state with the resultant observations. This interleaving supports highly granular perception, such as region-level visual cropping, targeted multimodal retrieval, and adaptive environment interaction.

RL Training: Algorithms and Reward Design

MindWatcher is trained exclusively via RL, employing a modified Group Relative Policy Optimization (GRPO) algorithm. Standard GRPO is extended with step-wise normalization to provide balanced optimization across both short and long episodes in the reasoning trajectory. Two normalization mechanisms are utilized:

  • Action-Step Normalization: Each trajectory, irrespective of its length or tool-call complexity, receives equal weight.
  • Token-Length Normalization: Loss is averaged per action segment, preventing dominance by lengthy tool-call episodes.

A hybrid reward is used, explicitly penalizing format violations and hallucinated tool calls, and rewarding outcome accuracy as judged by model-based assessment. The reward signal combines outcome correctness, schema adherence, and tool-call reality, thus shaping agentic precision in both syntax and factuality.

Multimodal Toolbox and Local Retrieval Infrastructure

MindWatcher’s toolset spans five modalities:

  • Region cropping/zooming (with image grounding)
  • Object grounding and visual search (via a local corpus)
  • External text retrieval (web search)
  • Webpage content extraction (structured semantic scraping)
  • Local code execution (Python sandbox)

To mitigate latency and reliability limitations of external APIs, MindWatcher incorporates a locally curated multimodal retrieval library with >300k images and 50k entities across eight major categories. Domain expert curation ensures >99% precision for downstream object and knowledge recognition.

Data Pipeline and Benchmark Construction

Training data includes both online and offline sample generation, featuring:

  • Automated multimodal QA construction pipelines (image-text pair generation, difficulty stratification via tool-invocation complexity metrics)
  • Human-in-the-loop verification for uniqueness and temporal stability
  • Domain-specific ingestion (sports news), optimized for objective, cross-modal fact extraction

MindWatcher-Evaluate Benchmark (MWE-Bench) is designed to robustly assess agentic tool-use capabilities across six categories, ensuring zero data leakage from the training corpus. Figure 2

Figure 2: MindWatcher performance on MWE-Bench, demonstrating SOTA accuracy across multiple categories.

Experimental Findings and Analysis

Empirical evaluation reveals:

  • MindWatcher-32B attains overall SOTA on MWE-Bench (75.35), significantly outperforming closed-source commercial agents such as Gemini 2.5 Flash and GPT-5 mini, especially in vehicle, animal, plant, and person domains.
  • Small-scale distilled models (2B/3B/4B) exhibit competitive, sometimes superior, performance due to agentic tool-use, challenging the prevailing notion that only large-parameter models are effective for TIR given proper agent training.
  • Benchmark results on open-source multimodal and pure-text QA tasks (MMSearch, SimpleVQA, WebWalkerQA) further corroborate MindWatcher’s generality. Figure 3

Figure 3

Figure 3: MindWatcher vs GPT-5 mini: Analysis of tool-invocation behaviors and performance decay in long-horizon reasoning.

Tool capacity is found to be a critical determinant: the choice of external search engine induces significant performance variance, often outweighing differences due to model scale or algorithmic improvements. This underscores the deeply coupled relationship between agentic capacity and tool infrastructure.

Genetic Inheritance in Agentic RL

Detailed behavioral analysis reveals that while MindWatcher’s RL enhances decision triggers and tool invocation proficiency, foundational cognitive constraints of the underlying LLM remain. There is a "Genetic Inheritance" effect: the decay slope in task accuracy with increasing reasoning steps mirrors that of the base model, illustrating that RL policy optimization is bounded by the intrinsic capabilities of the foundation model. Figure 4

Figure 4

Figure 4

Figure 4: Tool-use behavior comparison: MindWatcher-2B vs Qwen3-VL 2B Thinking demonstrates inherited accuracy trends despite agentic training.

Figure 5

Figure 5: Step-wise synchronous sampling infrastructure for agentic RL enabling efficient parallel batch inference and asynchronous tool invocation.

Implications and Future Directions

MindWatcher empirically demonstrates that agentic RL with curated multimodal tool platforms enables smaller models to mitigate parametric knowledge gaps and match/exceed state-of-the-art performance in demanding cross-modal QA and reasoning. However, benchmark validity is increasingly coupled with the world-knowledge distribution and tool ecosystem, complicating fine-grained assessment of intrinsic agentic reasoning capabilities.

The persistence of inherited performance shadows in RL and SFT regimes highlights the necessity for future research into foundation model architecture, memory augmentation, and more robust RL methodologies to transcend current cognitive ceilings. Continuous evolution of local, high-precision retrieval infrastructures and further integration of agentic multimodal primitives will be pivotal for scaling practical TIR agents.

Conclusion

MindWatcher introduces an RL-trained agentic framework for multimodal tool-integrated reasoning, demonstrating SOTA performance through dynamic planning and interleaved tool-use paradigms. Analysis uncovers both practical advances and inherent limitations associated with foundation model constraints and environmental coupling, informing subsequent research in autonomous agentic intelligence and multimodal decision making (2512.23412).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.