- The paper introduces Agent-X, a new benchmark designed to evaluate deep multimodal reasoning and multi-step decision-making in vision-centric AI agents across diverse real-world environments.
- Agent-X features 828 tasks spanning six areas and provides a fine-grained, step-level evaluation framework to assess reasoning quality, tool usage, and logical coherence.
- Evaluations on Agent-X reveal that state-of-the-art Large Multimodal Models currently achieve less than 50% success on multi-step vision tasks, highlighting significant limitations in sequential multimodal reasoning and tool use.
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
The paper introduces Agent-X, a benchmark developed to evaluate the deep multimodal reasoning capabilities of vision-centric agents in real-world settings. The authors identify a gap in current benchmarks, which largely focus on synthetic environments, single-turn queries, and limited visual modalities that fail to assess an agent's reasoning quality over multiple steps. Agent-X addresses these shortcomings by providing a comprehensive testbed with 828 diverse tasks spanning six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning.
The benchmark is designed to challenge agents to integrate tool usage with explicit stepwise decision-making. It features authentic visual contexts, such as images, multi-image comparisons, videos, and instructional text, requiring vision-centric agents to employ coherent, multi-step reasoning processes. Furthermore, Agent-X introduces a fine-grained step-level evaluation framework aimed at assessing the correctness and logical coherence of each reasoning step, alongside the effectiveness of tool usage throughout the task.
One of the key findings presented in the paper is the struggle of even the best-performing Large Multimodal Models (LMMs), including GPT, Gemini, and Qwen families, to solve multi-step vision tasks, with less than 50% full-chain success rates. These results expose significant bottlenecks in current LMM reasoning and tool-use capabilities, suggesting that existing models lack the depth needed for sequential multimodal understanding in complex, agentic scenarios.
Agent-X is constructed using a semi-automated pipeline where candidate queries are generated by LMMs and refined by human experts. This approach ensures the realism and correctness of the benchmark tasks and facilitates scalability. The tasks are grounded in real-world contexts and bypass direct tool references or step instructions, demanding the agent to reason independently, similar to naturalistic human interaction. The dataset is equipped with executable toolchains and a diverse set of real-world tools that integrate real-world aspects such as perception, visual operation, math, and artistic tasks.
The comprehensive evaluation performed on the benchmark indicates the necessity for more advanced models that can handle complex reasoning and tool usage in authentic environments. These findings provide actionable insights for guiding future research in improving reasoning capabilities in AI, particularly in vision-centric tasks. The results imply a potential shift in focus from solely improving model architectures to enhancing the integration and decision-making capabilities of AI agents.
In conclusion, Agent-X serves as a pivotal step towards advancing the evaluation of AI systems in multimodal reasoning contexts, shedding light on current limitations and directing future developments in AI research. The benchmark's introduction is timely, given the rapid evolution of AI capabilities and the increasing demand for robust evaluations of these systems in realistic scenarios.