VisualAgentBench: Multimodal Agent Benchmark

Updated 13 August 2025

VisualAgentBench is a comprehensive benchmark suite that evaluates large multimodal models as visual agents across interactive environments such as simulated, GUI, and design tasks.
It employs a hybrid trajectory approach combining program-based solvers, LMM bootstrapping, and expert human demonstrations to capture detailed agent behaviors.
The framework quantifies performance with metrics like task success rate and SSIM, illustrating performance gains via behavior cloning and fine-tuning.

VisualAgentBench is a comprehensive evaluation framework targeting the assessment and advancement of large multimodal models (LMMs) in their role as visual foundation agents. The benchmark is conceived to address limitations in prior agentic evaluation methodologies—particularly their inability to challenge agents across realistic visual interaction domains, capture agent development via trajectory learning, and systematically quantify performance in complex, partially observed environments that resemble real-world operational contexts.

1. Definition and Scope of VisualAgentBench

VisualAgentBench is designed as a unified benchmark suite for training and evaluating LMMs as generalist visual agents. The target domains include:

Embodied tasks: Simulated environments such as OmniGibson for household activities and complex game tasks exemplified by Minecraft.
Graphical User Interface (GUI) tasks: Mobile app interactions (VAB-Mobile), web browsing scenarios (WebArena-Lite), and other GUI-centric agent activities.
Visual Design tasks: CSS bug-fixing and web design tasks (VAB-CSS), where the agent must iteratively modify visual layouts and verify outcomes.

The benchmark is not restricted to static datasets but instead immerses agents in interactive environments where they must execute sequential actions, perceive dynamic visual states, and adaptively solve tasks that require intricate vision-language reasoning and decision-making (Liu et al., 12 Aug 2024).

2. Hybrid Trajectory Data Construction

A distinguishing methodology in VisualAgentBench is the construction of behavioral trajectory datasets through a hybrid approach:

Program-based Solvers: Autonomous programs (e.g., via Playwright) are used to generate ground-truth interaction sequences in rule-dominated environments. This method ensures cost efficiency and reproducibility across well-defined scenarios.
LMM Agent Bootstrapping: Strong proprietary LMMs (GPT-4, Claude, Gemini, etc.) are prompted to produce agent trajectories via chain-of-thought planning or error-correcting strategies. These trajectories reflect realistic agent behavior in domains with trial-and-error characteristics.
Human Demonstrations: Manual data annotation by experts supplements program and LMM-generated trajectories, especially for settings where automation is infeasible due to interface ambiguities or non-deterministic task flows.

Each procedure results in high-quality trajectories that provide not just terminal task outcomes, but sequential reasoning data and error recovery, facilitating effective supervised fine-tuning ("behavior cloning") for open LMMs.

3. Evaluation Protocols and Metrics

VisualAgentBench employs a success rate (SR) metric as the principal indicator of agent performance:

Success Rate in Interactive Environments: In OmniGibson and Minecraft, success is defined as achieving all predefined task goals within a fixed turn budget.
Rendering-based Evaluation: For visual design tasks (e.g., VAB-CSS), SR is determined using structural similarity (SSIM) between the rendered agent output and ground-truth reference images, with a threshold (e.g., SSIM > 0.9) denoting success.
Generalization and Adaptivity: The benchmark includes fine-tuned open models and compares them against nine proprietary LMM APIs and eight open-source contenders. Supervised fine-tuning on curated behavior trajectories consistently improves SR, with open-source agents closing much of the gap to their proprietary counterparts.

The underlying agent-environment formulation follows a Partially Observable Markov Decision Process (POMDP), notated as $(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \mathcal{O})$ , where $\mathcal{O}$ incorporates visual input modalities. Tables in the reference manuscript detail statistics such as the number of test instances, training trajectories, and average episode length for each environment.

4. Comparative and Multi-Domain Assessment

VisualAgentBench is distinctive in its multi-environment scope:

Domain	Example Environment	Metric
Embodied	OmniGibson, Minecraft	Task Success Rate
GUI	VAB-Mobile, WebArena-Lite	Element/Action Match
Visual Design	VAB-CSS	SSIM of Render Output

Evaluation is conducted via both direct interaction (agent-environment episodes) and trajectory replay. The benchmark suite includes both evaluation (test) and fine-tuning trajectories. It provides public code, datasets, and trained models to ensure reproducibility and facilitate further model improvement.

5. Behavior Cloning and Model Improvement

Through behavior cloning—supervised fine-tuning using trajectory data—VisualAgentBench demonstrates measurable agent skill improvements:

Performance gains: Fine-tuned open-source models exhibit marked increases in SR compared to zero-shot or prompt-based settings.
Error recovery sequencing: Training on sequences that include agent error and subsequent correction makes LMM agents more robust and less likely to repeat inefficient actions.
Model capability gap: Proprietary models achieve moderate success with advanced prompting alone; however, open models can achieve comparable performance when augmented with sufficient trajectory-driven fine-tuning.

6. Technical Specification and Resources

The benchmark solidifies its agent formulation using rigorous mathematical notation; tasks are formalized as POMDPs, and evaluation metrics are computed using LaTeX-standardized formulas (e.g., for SSIM, reward accumulation, action sequence comparison).

Substantial resources are provided for the community:

Codebase and dataset: Available via GitHub (\url{https://github.com/THUDM/VisualAgentBench}), including training data, test cases, and selected open models.
Documentation: Standardized templates and instructions ensure ease of integration for further LMM agent experimentation.

7. Future Directions and Impact

VisualAgentBench is designed as a foundation for continuous improvement:

Reinforcement learning integration: Future work may incorporate RL-based dynamic policy adaptation and longer-horizon planning across environments.
Expanded diversity: Domain growth will enable evaluation in further complex agent tasks.
Agent development scaffold: By establishing a comprehensive training and evaluation framework, VisualAgentBench guides research toward agents capable of real-world deployment in increasingly varied operational settings.

VisualAgentBench establishes a principled, scalable, and reproducible standard for visual foundation agent evaluation. Through its hybrid trajectory methodology, interactive environment scope, and transparent metrics, it enables systematic assessment, targeted agent improvement, and rigorous benchmarking for multimodal intelligent agents in practical, vision-centric applications (Liu et al., 12 Aug 2024).

PDF Markdown Chat (Pro)

References (1)

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to VisualAgentBench.