macOSWorld: A Multilingual Interactive Benchmark for GUI Agents (2506.04135v2)

Published 4 Jun 2025 in cs.AI

Abstract: Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead at above 30% success rate, while open-source lightweight research models lag at below 2%, highlighting the need for macOS domain adaptation. Multilingual benchmarks also expose common weaknesses, especially in Arabic, with a 27.5% average degradation compared to English. Results from safety benchmarking also highlight that deception attacks are more general and demand immediate attention. macOSWorld is available at https://github.com/showlab/macosworld.

Summary

The paper introduces macOSWorld, a benchmark evaluating macOS GUI agents via 202 interactive tasks in 30 apps and five languages.
It reveals a stark performance gap where proprietary agents achieve over 30% success versus less than 2% for open-source models, with notable language-specific disparities.
Safety evaluations expose significant vulnerabilities to deception attacks, underscoring the need for better defensive mechanisms in GUI agent design.

Overview of macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

The paper introduces "macOSWorld," a comprehensive benchmark designed to evaluate Graphical User Interface (GUI) agents within the macOS ecosystem. This benchmark is pivotal as prior benchmarks largely focus on environments such as Windows, Linux, and Android, leaving a void in the evaluation tools available for macOS. By catering to macOS—a platform with distinct GUI paradigms and exclusive applications—the authors address three major deficiencies in existing benchmarks: lack of macOS coverage, support primarily for English-language tasks, and inadequate attention to safety evaluations, specifically concerning GUI agent vulnerability to deception attacks.

macOSWorld is constructed with 202 interactive tasks spanning 30 applications, 28 of which are exclusive to macOS. It facilitates evaluations in five languages (English, Chinese, Arabic, Japanese, and Russian), thus embracing linguistic diversity. This benchmark also includes a unique safety subset that tests agents against realistic macOS-style deceptive pop-up windows, highlighting the prevalent issue of context deception attacks.

Key Findings and Results

Upon evaluating six GUI agents, a disparity in performance was noted. Proprietary computer-using agents surpassed others with success rates above 30%, whereas open-source models exhibited a significant performance decline, below 2%. The evaluation highlighted language-specific capabilities with observed degradation strikingly manifesting in Arabic, where performance fell by 27.5% compared to English. This discrepancy underscores the adaptation challenges GUI agents face within the macOS domain and across different languages.

Additionally, the safety tests revealed that deception attacks are a general concern, affecting the two open-source agents tested, confirming their heightened vulnerability with deception rates approximating 70%.

Implications and Future Directions

The macOSWorld benchmark has significant implications for both the development and evaluation of GUI agents. The documented performance gap between proprietary and open-source models suggests that more research is needed into enhancing open-source models' comprehension of macOS-specific navigation patterns and menu structures. The benchmark's multilingual support effectively raises the need for adopting more robust approaches in handling text orientation and layout changes induced by different languages, notably right-to-left scripts like Arabic.

The results from the safety evaluations call for an urgent development of defensive mechanisms against context deception attacks, suggesting that GUI agents require reinforcing in terms of security and error-recovery to facilitate broader and safer deployment in real-world applications.

Future work could include expanding the benchmark to encompass non-binary reward systems, enabling more nuanced evaluations and facilitating reinforcement learning to further enhance GUI agent resilience and proficiency. Additionally, integrating adaptive learning methodologies might be considered to improve the agents’ capabilities across diverse macOS interfaces and multilingual settings.

In conclusion, macOSWorld sets a new standard for comprehensive evaluation tools for GUI agents on macOS, aiming to mitigate current shortcomings in multilingual support and safety considerations, while steering future advancements in adaptive GUI interaction frameworks.

PDF Markdown

YouTube