Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 60 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 168 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection (2505.20289v2)

Published 26 May 2025 in cs.CV

Abstract: We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.

Summary

The paper introduces VisTA, a novel reinforcement learning framework that autonomously selects effective visual tools based on task outcomes.
It employs the GRPO algorithm to optimize tool combinations, significantly improving visual reasoning on benchmarks like ChartQA and Geometry3K.
Results show that adaptive tool selection narrows performance gaps, highlighting the RL agent's ability to generalize across diverse visual tasks.

VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Introduction

The "VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection" introduces an RL framework that equips visual agents with the capability to autonomously explore, select, and integrate tools from a diverse library based on their empirical performance. This framework contrasts with traditional approaches that either involve training-free prompting or extensive fine-tuning to integrate tools, both of which limit tool diversity and exploration. VisTA leverages the GRPO algorithm to enable agents to discover and refine sophisticated tool-selection strategies without explicit supervision.

Figure 1: Overview of VisTA. (Left) Our method trains an agent to autonomously discover effective combinations of visual tools without human supervision. (Right) By decoupling the agent from the reasoner, the learned policy can be seamlessly integrated with a wide range of reasoning models.

Methodology

VisTA utilizes reinforcement learning to train an autonomous agent capable of selecting the most effective tools from a library suited to solve complex visual reasoning tasks. This RL framework is designed to maximize the reasoning model's performance by adaptively selecting tool combinations based on task outcomes as feedback signals.

Figure 2: Policy Optimization. Given a user query, the agent selects tools from a pre-defined set of external tools. The tools are applied to the image, and their outputs and the query are fed to a frozen reasoner model. Both the Direct Path (query + image) and the Tool-Augmented Path (query + tools + image) are evaluated to compute a reward signal, which is used to update the agent's tool-selection policy.

Problem Formulation

Given a visual-language query $(q, I)$ , the agent observes this input and learns a policy $\pi_{\phi}(t \mid s)$ that dictates the selection of a sequence of tools for that particular query. The reward structure is designed to encourage tool selections that improve reasoning model outcomes, while penalizing selections that detract from performance.

Experimental Results

VisTA demonstrates substantial performance improvements over baseline methods on benchmarks like ChartQA and Geometry3K. The RL-based approach significantly advances the ability of models to generalize across out-of-distribution examples, thereby enhancing adaptive tool utilization.

Figure 3: Comparison of ChartQA accuracy across individual tools (T0–T8), the no-tool baseline (No), our RL-based selection policy (Ours), and a pseudo-upper bound (Upper).

Figure 3 illustrates the efficacy of VisTA's selection strategy, outperforming individual tool performance and narrowing the gap towards a pseudo-upper bound.

Analyzing Tool Selection and Agent Behavior

The VisTA framework highlights a novel potential for dynamic tool selection, suited to the varying complexities of visual reasoning tasks.

Figure 4: Pearson correlation between tool usage frequency and individual tool performance.

Figure 5: Tool selection frequency across our RL-trained agent, QwenVL-7B, and GPT-4o. Our method strongly favors effective tools (Tools 1 and 2) and avoids less useful ones, while QwenVL-7B shows a uniform distribution and GPT-4o selects broadly without clear alignment to tool performance.

The analysis shows an increasing alignment of tool selection frequency with tool utility, indicating effective optimization by the RL agent, as depicted in Figures 5 and 6.

Conclusion

The VisTA framework presents a novel approach for enhancing visual reasoning by dynamically and autonomously selecting tools through reinforcement learning. Its ability to generalize across challenges and improve reasoning performance without necessitating retraining of the underlying reasoning models places VisTA as a promising direction for future research and applications in AI. The implications of this research are crucial for the development of adaptive, modular reasoning systems capable of handling complex, real-world visual tasks.