Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning (2504.21561v3)

Published 30 Apr 2025 in cs.CV

Abstract: Multimodal agents, which integrate a controller e.g., a vision LLM) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using LLMs. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

Summary

Iterative Trajectory Exploration for Multimodal Agents: An Expert Analysis

The paper "Iterative Trajectory Exploration for Multimodal Agents" introduces an innovative approach to enhancing the learning and adaptability of multimodal agents through a novel framework called SPORT (Step-wise Preference Optimization to Refine Trajectories). The research addresses the persistent challenge in multimodal agent development, primarily the need for vast amounts of expert-annotated data to fine-tune agents for new environments, which is both resource-intensive and potentially biased.

Methodological Advances

The SPORT framework represents a significant methodological advance by integrating online self-exploration into the agent training process. The framework comprises four iterative components: task synthesis, step sampling, step verification, and preference tuning. This iterative cycle allows agents to autonomously generate tasks and explore potential solutions without the need for pre-collected expert data. By utilizing LLMs to synthesize tasks and deploying a step-wise verification process, the framework effectively gathers step-level preference data, which is then used to refine the agent's policy.

A distinctive feature of this approach is the preference optimization methodology, which differs from traditional supervised fine-tuning (SFT) or reinforcement learning (RL) by leveraging step-wise feedback, potentially increasing the granularity and usefulness of the data used for training. The SPORT framework thus stands out by optimizing agents through direct interaction with the environment, enhancing both generalization and efficiency.

Experimental Validation

The research is validated using two benchmarks: GTA and GAIA, which are designed to test multimodal reasoning and tool usage capabilities of AI agents across a variety of contexts. The SPORT Agent demonstrated improvements of 6.41% on the GTA benchmark and 3.64% on the GAIA benchmark over existing methods, indicating a superior ability to generalize across complex tasks without human supervision. These improvements point to the efficacy of the SPORT methodology in enhancing both decision-making and tool-utilization capabilities.

Implications and Future Directions

The implications of this research are twofold. Practically, it offers a scalable solution for training multimodal agents, significantly reducing the dependency on labor-intensive expert data collection. Theoretically, it proposes a robust framework that could redefine how agents interact with and learn from their environments.

Future developments could explore the applicability of the SPORT framework across other domains where multimodal interaction is key. Additionally, refining the step-wise preference optimization could involve integrating more sophisticated AI feedback mechanisms or enhancing the scalability of the methodology to accommodate even more complex task environments.

Conclusion

This paper makes a substantial contribution to the field of AI by proposing a novel, efficient, and scalable method for training multimodal agents. By mitigating the reliance on expert data and employing a self-exploration framework, this research sets a promising trajectory for future advancements in autonomous agent development. The empirical results validate the approach's potential to improve the adaptability and functionality of AI agents, ensuring relevance in both current and future AI landscapes.