Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks (2505.24876v1)

Published 30 May 2025 in cs.CV and cs.CL

Abstract: Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X

Summary

The paper introduces Agent-X, a new benchmark designed to evaluate deep multimodal reasoning and multi-step decision-making in vision-centric AI agents across diverse real-world environments.
Agent-X features 828 tasks spanning six areas and provides a fine-grained, step-level evaluation framework to assess reasoning quality, tool usage, and logical coherence.
Evaluations on Agent-X reveal that state-of-the-art Large Multimodal Models currently achieve less than 50% success on multi-step vision tasks, highlighting significant limitations in sequential multimodal reasoning and tool use.

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

The paper introduces Agent-X, a benchmark developed to evaluate the deep multimodal reasoning capabilities of vision-centric agents in real-world settings. The authors identify a gap in current benchmarks, which largely focus on synthetic environments, single-turn queries, and limited visual modalities that fail to assess an agent's reasoning quality over multiple steps. Agent-X addresses these shortcomings by providing a comprehensive testbed with 828 diverse tasks spanning six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning.

The benchmark is designed to challenge agents to integrate tool usage with explicit stepwise decision-making. It features authentic visual contexts, such as images, multi-image comparisons, videos, and instructional text, requiring vision-centric agents to employ coherent, multi-step reasoning processes. Furthermore, Agent-X introduces a fine-grained step-level evaluation framework aimed at assessing the correctness and logical coherence of each reasoning step, alongside the effectiveness of tool usage throughout the task.

One of the key findings presented in the paper is the struggle of even the best-performing Large Multimodal Models (LMMs), including GPT, Gemini, and Qwen families, to solve multi-step vision tasks, with less than 50% full-chain success rates. These results expose significant bottlenecks in current LMM reasoning and tool-use capabilities, suggesting that existing models lack the depth needed for sequential multimodal understanding in complex, agentic scenarios.

Agent-X is constructed using a semi-automated pipeline where candidate queries are generated by LMMs and refined by human experts. This approach ensures the realism and correctness of the benchmark tasks and facilitates scalability. The tasks are grounded in real-world contexts and bypass direct tool references or step instructions, demanding the agent to reason independently, similar to naturalistic human interaction. The dataset is equipped with executable toolchains and a diverse set of real-world tools that integrate real-world aspects such as perception, visual operation, math, and artistic tasks.

The comprehensive evaluation performed on the benchmark indicates the necessity for more advanced models that can handle complex reasoning and tool usage in authentic environments. These findings provide actionable insights for guiding future research in improving reasoning capabilities in AI, particularly in vision-centric tasks. The results imply a potential shift in focus from solely improving model architectures to enhancing the integration and decision-making capabilities of AI agents.

In conclusion, Agent-X serves as a pivotal step towards advancing the evaluation of AI systems in multimodal reasoning contexts, shedding light on current limitations and directing future developments in AI research. The benchmark's introduction is timely, given the rapid evolution of AI capabilities and the increasing demand for robust evaluations of these systems in realistic scenarios.

Related Papers

GitHub

GitHub - mbzuai-oryx/Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks (2 stars)

Tweets

https://twitter.com/KhanSalmanH/status/1929462730534121622

https://twitter.com/ArxivToday/status/1929944868144525505