MMBench-GUI: Cross-Platform GUI Automation Benchmark

Updated 28 July 2025

MMBench-GUI is a hierarchical, cross-platform benchmark designed to evaluate GUI automation agents through multi-level challenges in visual grounding, task planning, and collaboration.
It introduces the novel Efficiency–Quality Area (EQA) metric that jointly rewards task success while penalizing redundant actions to highlight inefficiencies.
The benchmark tests agents across Windows, macOS, Linux, Android, iOS, and Web, promoting modular architectures and advanced cross-app automation strategies.

MMBench-GUI is a hierarchical, cross-platform benchmark for rigorous evaluation of GUI automation agents on Windows, macOS, Linux, Android, iOS, and Web environments. It is constructed to systematically probe essential agent competencies, ranging from screenshot comprehension to complex multi-application orchestration, and introduces the Efficiency–Quality Area (EQA) metric for assessing the efficiency of online task execution. MMBench-GUI reveals that accurate visual grounding, robust task planning, and cross-platform generalization—with explicit focus on task efficiency—are critical for scalable and practical GUI automation (Wang et al., 25 Jul 2025).

1. Hierarchical Multi-Platform Evaluation Framework

MMBench-GUI is structured into four increasing levels of challenge, each targeting distinct aspects of GUI agent intelligence and robustness:

Level	Evaluated Capability	Example Requirement
GUI Content Understanding	Visual+textual information extraction & reasoning	Multiple-choice Q&A about UI screenshots
Element Grounding	Precise spatial localization of interactives	Returning (x, y) within annotated bounding box
Task Automation	End-to-end intra-app planning and adaptation	Multi-step completion (e.g., navigation, data entry)
Task Collaboration	Cross-application multi-stage workflows	Transferring info between apps, context switching

Each level isolates then compounds key agent skills. At Level 1, the agent must answer multiple-choice questions about screenshot content; Level 2 demands generation of coordinates corresponding to targeted UI regions; Level 3 tests iterative perception-action cycles inside a single app; Level 4 escalates requirements to cross-app coordination and interdependent subtasks. This explicit stratification exposes bottlenecks in perception, grounding, reasoning, and planning, both in isolation and combination.

The benchmark is implemented for six platforms—Windows, macOS, Linux, Android, iOS, and Web—enabling assessment of cross-environment generalization.

2. Efficiency–Quality Area (EQA) Metric

Recognizing that previous benchmarks emphasize accuracy at the expense of execution efficiency, MMBench-GUI introduces the Efficiency–Quality Area (EQA) metric:

For $N$ ordered tasks, each task $i$ is associated with a binary success $s_i$ and a step cost (number of actions) $t_i$ .
The cumulative cost $T_k = \sum_{j=1}^k t_j$ and cumulative success $S_k = \sum_{j=1}^k s_j$ are computed.
Normalizing by the total allowed steps $T_{max}$ produces a cumulative time axis.
Instantaneous recall is defined as $R(u) = \max_k (S_k/N)$ for normalized time $u \in [0, 1]$ .
EQA is the area under the recall curve: a higher area indicates more tasks completed with fewer steps.

This metric jointly rewards task completion (quality) and penalizes excessive, redundant actions (efficiency), surfacing execution wastefulness not diagnosable via success rate alone. Empirically, all evaluated models complete many tasks with substantial inefficiency, e.g., redundant interface traversals and delayed stopping, resulting in a nontrivial EQA gap (Wang et al., 25 Jul 2025).

3. Visual Grounding and Modular Frameworks

Precise visual grounding—the ability to localize and identify widgets or targets using natural language or visual hints—is shown to be a primary determinant of overall agent performance. In Level 2, agents are required to output coordinates that must fall within gold-standard bounding boxes. Errors at this stage propagate to higher-level planning, often leading to false actions even if the agent’s subsequent logic is sound.

Empirical analysis demonstrates that state-of-the-art LLM-based planners are unable to compensate for grounding errors. Consequently, the authors emphasize modular agent architectures, partitioning perception (grounding) from action selection (planning), and advocate dedicated grounding modules as critical components for scalable, cross-app, cross-platform automation.

4. Task Planning, Memory, and Action Space

Robust task automation, particularly at Level 3 and 4, requires agents to exhibit:

Long-context memory to track progress and UI state over long sequences,
A broad and hierarchical action space (supports clicks, typing, drag, window switching, etc.),
Long-term reasoning to handle deferred goals and adapt under partial observation.

Cross-platform generalization is explicitly tested; layout variation and different input paradigms across operating systems directly challenge agent robustness. Agents with insufficient action spaces or myopic memory consistently underperform, especially in complex workflows requiring context handoff or delayed reward.

5. Task Efficiency and Early Stopping

Current models exhibit pronounced inefficiency: they tend to perform excessive, often redundant, actions even after “solving” a given problem. The primary causes identified are imprecise element localization, incomplete modeling of the environment’s transition logic, and insufficient exploitation of stop criteria. The EQA metric captures these inefficiencies and provides a differentiated signal from task success alone.

Potential remedies, already identified in the MMBench-GUI design, are: improved precision in localization (better grounding), integration of early stopping heuristics, and breakthroughs in sequence-level planning. Modular frameworks with explicit perception–acting separation are highlighted as a promising direction.

6. Public Resources and Implementation

To encourage replicable and extensible research, MMBench-GUI’s full codebase, data, and execution environments are released as open-source artifacts at https://github.com/open-compass/MMBench-GUI (Wang et al., 25 Jul 2025). The codebase provides evaluation scripts, platform-specific setup instructions, representative tasks for each level and platform, and parsers for model outputs. This facilitates unified benchmarking and ablation studies across agent architectures and supports standardized head-to-head agent comparison.

7. Impact and Future Directions

MMBench-GUI establishes a new standard for systematic, multi-platform agent evaluation. By combining fine-grained hierarchical skills, the EQA metric, and coverage across real-world operating environments, it exposes critical gaps in current agent abilities—particularly in visuo-spatial grounding and efficient sequence planning. The benchmark’s findings substantiate the central role of modular, grounding-aware designs, and motivate further attention to efficiency, memory, and adaptive cross-platform behavior. Its public release is intended to catalyze both academic and applied progress toward practical, scalable GUI automation agents.

The framework’s explicit focus on both task efficiency and collaborative, cross-app scenarios creates a substantive foundation for future work in the development and evaluation of general-purpose GUI agents (Wang et al., 25 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MMBench-GUI.