- The paper introduces an RL-trained agentic framework that interleaves internal reasoning with external tool invocation for dynamic multimodal decision-making.
- It employs a modified GRPO algorithm with step-wise normalization to balance performance across varied tool-call complexities.
- Results demonstrate that even smaller models can achieve SOTA performance on cross-modal QA tasks through precise tool integration and local retrieval mechanisms.
Motivation and Introduction
MindWatcher addresses persistent limitations in existing LLM-centric Tool-Integrated Reasoning (TIR) agents, particularly their inability to autonomously invoke and effectively coordinate multiple external tools for real-world problem solving. Most prior TIR systems are constrained to text-based retrieval, lack agentic multimodal capabilities, and suffer from poor adaptability in open-domain, multi-step, cross-modal environments. MindWatcher integrates interleaved thinking and multimodal chain-of-thought (CoT) reasoning in an agentic framework, designed to flexibly alternate between internal reasoning and tool invocation, and directly manipulate visual inputs during inference.
The system abandons standard SFT for agent training and instead leverages continuous RL in real and simulated environments. This approach enables MindWatcher to achieve robust autonomous planning, execution, and tool-use behaviors, surpassing parametric knowledge bottlenecks.
Architecture and Working Paradigm
The reasoning process in MindWatcher is formalized as an MDP, with agent actions consisting of unified thought and tool_call segments. The model serializes these segments within an autoregressive loop (> …, <tool_call>…</tool_call>). The action space combines both cognitive reasoning and physical tool executions, enabling complex multimodal CoT trajectories and precise visual operations during inference.
Figure 1: Paradigm: MindWatcher alternates between reasoning and tool invocation within multimodal CoT, guided by RL, leveraging a local high-quality retrieval corpus.
The agent processes each input by iteratively planning, triggering relevant tool calls, and updating its internal state with the resultant observations. This interleaving supports highly granular perception, such as region-level visual cropping, targeted multimodal retrieval, and adaptive environment interaction.
RL Training: Algorithms and Reward Design
MindWatcher is trained exclusively via RL, employing a modified Group Relative Policy Optimization (GRPO) algorithm. Standard GRPO is extended with step-wise normalization to provide balanced optimization across both short and long episodes in the reasoning trajectory. Two normalization mechanisms are utilized:
- Action-Step Normalization: Each trajectory, irrespective of its length or tool-call complexity, receives equal weight.
- Token-Length Normalization: Loss is averaged per action segment, preventing dominance by lengthy tool-call episodes.
A hybrid reward is used, explicitly penalizing format violations and hallucinated tool calls, and rewarding outcome accuracy as judged by model-based assessment. The reward signal combines outcome correctness, schema adherence, and tool-call reality, thus shaping agentic precision in both syntax and factuality.
MindWatcher’s toolset spans five modalities:
- Region cropping/zooming (with image grounding)
- Object grounding and visual search (via a local corpus)
- External text retrieval (web search)
- Webpage content extraction (structured semantic scraping)
- Local code execution (Python sandbox)
To mitigate latency and reliability limitations of external APIs, MindWatcher incorporates a locally curated multimodal retrieval library with >300k images and 50k entities across eight major categories. Domain expert curation ensures >99% precision for downstream object and knowledge recognition.
Data Pipeline and Benchmark Construction
Training data includes both online and offline sample generation, featuring:
- Automated multimodal QA construction pipelines (image-text pair generation, difficulty stratification via tool-invocation complexity metrics)
- Human-in-the-loop verification for uniqueness and temporal stability
- Domain-specific ingestion (sports news), optimized for objective, cross-modal fact extraction
MindWatcher-Evaluate Benchmark (MWE-Bench) is designed to robustly assess agentic tool-use capabilities across six categories, ensuring zero data leakage from the training corpus.
Figure 2: MindWatcher performance on MWE-Bench, demonstrating SOTA accuracy across multiple categories.
Experimental Findings and Analysis
Empirical evaluation reveals:
- MindWatcher-32B attains overall SOTA on MWE-Bench (75.35), significantly outperforming closed-source commercial agents such as Gemini 2.5 Flash and GPT-5 mini, especially in vehicle, animal, plant, and person domains.
- Small-scale distilled models (2B/3B/4B) exhibit competitive, sometimes superior, performance due to agentic tool-use, challenging the prevailing notion that only large-parameter models are effective for TIR given proper agent training.
- Benchmark results on open-source multimodal and pure-text QA tasks (MMSearch, SimpleVQA, WebWalkerQA) further corroborate MindWatcher’s generality.

Figure 3: MindWatcher vs GPT-5 mini: Analysis of tool-invocation behaviors and performance decay in long-horizon reasoning.
Tool capacity is found to be a critical determinant: the choice of external search engine induces significant performance variance, often outweighing differences due to model scale or algorithmic improvements. This underscores the deeply coupled relationship between agentic capacity and tool infrastructure.
Genetic Inheritance in Agentic RL
Detailed behavioral analysis reveals that while MindWatcher’s RL enhances decision triggers and tool invocation proficiency, foundational cognitive constraints of the underlying LLM remain. There is a "Genetic Inheritance" effect: the decay slope in task accuracy with increasing reasoning steps mirrors that of the base model, illustrating that RL policy optimization is bounded by the intrinsic capabilities of the foundation model.


Figure 4: Tool-use behavior comparison: MindWatcher-2B vs Qwen3-VL 2B Thinking demonstrates inherited accuracy trends despite agentic training.
Figure 5: Step-wise synchronous sampling infrastructure for agentic RL enabling efficient parallel batch inference and asynchronous tool invocation.
Implications and Future Directions
MindWatcher empirically demonstrates that agentic RL with curated multimodal tool platforms enables smaller models to mitigate parametric knowledge gaps and match/exceed state-of-the-art performance in demanding cross-modal QA and reasoning. However, benchmark validity is increasingly coupled with the world-knowledge distribution and tool ecosystem, complicating fine-grained assessment of intrinsic agentic reasoning capabilities.
The persistence of inherited performance shadows in RL and SFT regimes highlights the necessity for future research into foundation model architecture, memory augmentation, and more robust RL methodologies to transcend current cognitive ceilings. Continuous evolution of local, high-precision retrieval infrastructures and further integration of agentic multimodal primitives will be pivotal for scaling practical TIR agents.
Conclusion
MindWatcher introduces an RL-trained agentic framework for multimodal tool-integrated reasoning, demonstrating SOTA performance through dynamic planning and interleaved tool-use paradigms. Analysis uncovers both practical advances and inherent limitations associated with foundation model constraints and environmental coupling, informing subsequent research in autonomous agentic intelligence and multimodal decision making (2512.23412).