Papers
Topics
Authors
Recent
2000 character limit reached

MAS-Bench: Mobile GUI-Shortcut Benchmark

Updated 15 September 2025
  • MAS-Bench is a unified benchmark that evaluates hybrid GUI and shortcut agents in mobile app environments.
  • It combines flexible GUI interactions with intelligent shortcut generation to improve task efficiency and success rates.
  • The benchmark standardizes evaluation metrics across complex tasks, facilitating insights into agent performance and scalability.

MAS-Bench denotes a unified benchmark designed to evaluate GUI-shortcut hybrid agents, with particular attention to the mobile domain. This framework systematically assesses agent efficiency, robustness, and shortcut-generation capabilities across complex real-world tasks. MAS-Bench defines a hybrid interaction paradigm in which intelligent agents combine flexible GUI operations (click, swipe, type) with shortcut invocation (API calls, deep links, RPA scripts) to accelerate mobile workflows. The benchmark provides a standardized methodology for measuring both agent effectiveness and innovation in shortcut use—a foundational contribution for advancing intelligent agent research on ubiquitous smartphone platforms (Zhao et al., 8 Sep 2025).

1. Benchmark Overview and Motivation

MAS-Bench was created to address deficiencies in the evaluation of mobile GUI agents, which historically rely almost exclusively on primitive GUI actions. Recognizing that the integration of shortcuts (such as APIs, deep links, and RPA macros) can significantly enhance task automation efficiency, MAS-Bench formalizes a hybrid environment.

Key attributes include:

  • 139 complex tasks spanning 11 Android applications (e.g., YouTube, Amazon, Booking.com).
  • Distinct coverage for single-app (92 tasks) and cross-app workflows (47 tasks).
  • A knowledge base of 88 predefined shortcuts, distributed among APIs (identified via documentation and static analysis), deep links (parsed from manifest files), and RPA scripts (manually crafted for frequent subtasks).
  • Each task is, by design, solvable purely via GUI operations, enabling controlled assessment of the efficiency advantage conferred by shortcut use.

MAS-Bench thereby enables rigorous comparison between GUI-only and hybrid agents, capturing both raw success rates and operational efficiency.

2. Evaluation Metrics: Effectiveness and Efficiency

MAS-Bench employs a multidimensional evaluation protocol encompassing the following metrics:

Metric Purpose Description
Success Rate (SR) Effectiveness Portion of tasks fully completed
Mean Steps (MS) / Mean Step Ratio (MSR) Efficiency MSR=actual stepsoptimal stepsMSR = \frac{\text{actual steps}}{\text{optimal steps}} measures policy optimality
Mean Execution Time (MET) Efficiency Average task completion duration
Mean kToken Cost (MToC) Cost Number of prompt tokens consumed
Mean Shortcut Call Count (MSC) Cost Shortcut usage frequency
GUI-to-Shortcut Action Ratio (GSAR) Policy Proportion of GUI vs. shortcut actions taken

These metrics enable fine-grained analysis of agents' ability not just to complete tasks, but to do so optimally. For example, a lower MSR directly indicates a more efficient behavioral policy.

3. Hybrid Paradigm and Action Space

The hybrid paradigm underpinning MAS-Bench permits agents to interleave primitive GUI actions (tap, swipe, type, home, back) with shortcut invocations (API calls, deep links, and RPA macros). Notably, every MAS-Bench task admits a GUI-only solution; however, intelligently exploiting shortcuts allows bypassing repetitive or lengthy interaction sequences.

Predefined shortcuts stem from static and dynamic analysis:

  • APIs: Programmatic endpoints for direct business logic invocation.
  • Deep Links: URI entry points targeting specific screens or functions.
  • RPA Scripts: Automations for recurring workflow fragments (e.g., product checkout).

Agent-generated shortcuts are supported, evaluated, and compared to baseline performance. Agents may learn shortcuts either by replaying and abstracting action trajectories (macro-level or subtask-level) or by dynamically recording context-specific workflows (dynamic shortcut grounding).

4. Shortcut Generation and Macro Replay

MAS-Bench enables explicit benchmarking of agents' shortcut-generation capabilities. The framework distinguishes:

  • Predefined Shortcuts: Static catalog, verified for correctness.
  • Agent-Generated Shortcuts:
    • Macro-level (entire tasks replayed as callable macros),
    • Subtask-level (frequently-encountered sub-sequence abstraction),
    • Dynamic grounding (real-time adaptation for UI changes).

Table (in paper) reports, for example, 39 task-level and 46 subtask-level agent-generated shortcuts, 45 dynamic shortcuts, and an additional set generated using MobileAgent-E procedures.

Agent performance is measured both for shortcut invocation frequency and shortcut success rate (SSR), with SSR = 1.0 for predefined shortcuts and slightly lower for agent-generated ones due to adaptivity constraints.

5. Empirical Results: Efficiency Gains and Robustness

Empirical assessment shows consistent and substantial performance advantages for hybrid agents relative to GUI-only baselines:

  • In single-app tasks, hybrid agents (e.g., MAS-MobileAgent) achieve SR up to 64.1% (Gemini-2.5-Pro model) compared to approximately 44.6% for GUI-only agents.
  • Efficiency: Hybrid agents demonstrate reductions in MSR, MET, and MToC, indicative of optimal workflow execution and reduced computational resources.
  • Cross-app tasks: The inclusion of shortcut capabilities increases SR from 0% (GUI-only) to >23% (Gemini-2.0-Flash) and significantly reduces execution times—an especially notable improvement in scenarios that demand application switching and context preservation.
  • Shortcut Quality: Predefined shortcuts maintain nearly perfect SSR; agent-generated shortcuts require further research for robustness, particularly under dynamic UI conditions.

6. Significance, Limitations, and Future Directions

MAS-Bench sets a new standard for evaluating the convergence of GUI and shortcut paradigms. It encourages agent designers to emphasize intelligent shortcut selection, adaptive macro generation, and comprehensive knowledge integration. Gain in SR and efficiency is especially magnified for weaker base models, emphasizing MAS-Bench utility in scalable agent development.

Identified challenges include planning errors, behavioral adaptation to unanticipated UI changes, and optimization of shortcut generation algorithms. By methodically documenting failure modes, MAS-Bench supports targeted research into more robust agents.

As mobile automation becomes central to digital productivity, MAS-Bench will underpin rigorous evaluation of agent architectures that combine flexible GUI interaction with efficient shortcut use. The benchmark's extensibility and empirical granularity promise continued relevance for the mobile agent research community.

7. Conclusion

MAS-Bench provides a unified platform for systematically evaluating shortcut-augmented hybrid GUI agents in mobile environments. By cataloging real-world application tasks, integrating a versatile shortcut knowledge base, and defining rigorous evaluation criteria that measure both effectiveness and efficiency, MAS-Bench fills a critical gap in intelligent agent benchmarking. It supports research into both agent-controlled shortcut generation and hybrid policy design, and establishes a foundation for future development of more adaptive, efficient, and robust mobile GUI agents (Zhao et al., 8 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MAS-Bench.