- The paper introduces macOSWorld, a benchmark evaluating macOS GUI agents via 202 interactive tasks in 30 apps and five languages.
- It reveals a stark performance gap where proprietary agents achieve over 30% success versus less than 2% for open-source models, with notable language-specific disparities.
- Safety evaluations expose significant vulnerabilities to deception attacks, underscoring the need for better defensive mechanisms in GUI agent design.
Overview of macOSWorld: A Multilingual Interactive Benchmark for GUI Agents
The paper introduces "macOSWorld," a comprehensive benchmark designed to evaluate Graphical User Interface (GUI) agents within the macOS ecosystem. This benchmark is pivotal as prior benchmarks largely focus on environments such as Windows, Linux, and Android, leaving a void in the evaluation tools available for macOS. By catering to macOS—a platform with distinct GUI paradigms and exclusive applications—the authors address three major deficiencies in existing benchmarks: lack of macOS coverage, support primarily for English-language tasks, and inadequate attention to safety evaluations, specifically concerning GUI agent vulnerability to deception attacks.
macOSWorld is constructed with 202 interactive tasks spanning 30 applications, 28 of which are exclusive to macOS. It facilitates evaluations in five languages (English, Chinese, Arabic, Japanese, and Russian), thus embracing linguistic diversity. This benchmark also includes a unique safety subset that tests agents against realistic macOS-style deceptive pop-up windows, highlighting the prevalent issue of context deception attacks.
Key Findings and Results
Upon evaluating six GUI agents, a disparity in performance was noted. Proprietary computer-using agents surpassed others with success rates above 30%, whereas open-source models exhibited a significant performance decline, below 2%. The evaluation highlighted language-specific capabilities with observed degradation strikingly manifesting in Arabic, where performance fell by 27.5% compared to English. This discrepancy underscores the adaptation challenges GUI agents face within the macOS domain and across different languages.
Additionally, the safety tests revealed that deception attacks are a general concern, affecting the two open-source agents tested, confirming their heightened vulnerability with deception rates approximating 70%.
Implications and Future Directions
The macOSWorld benchmark has significant implications for both the development and evaluation of GUI agents. The documented performance gap between proprietary and open-source models suggests that more research is needed into enhancing open-source models' comprehension of macOS-specific navigation patterns and menu structures. The benchmark's multilingual support effectively raises the need for adopting more robust approaches in handling text orientation and layout changes induced by different languages, notably right-to-left scripts like Arabic.
The results from the safety evaluations call for an urgent development of defensive mechanisms against context deception attacks, suggesting that GUI agents require reinforcing in terms of security and error-recovery to facilitate broader and safer deployment in real-world applications.
Future work could include expanding the benchmark to encompass non-binary reward systems, enabling more nuanced evaluations and facilitating reinforcement learning to further enhance GUI agent resilience and proficiency. Additionally, integrating adaptive learning methodologies might be considered to improve the agents’ capabilities across diverse macOS interfaces and multilingual settings.
In conclusion, macOSWorld sets a new standard for comprehensive evaluation tools for GUI agents on macOS, aiming to mitigate current shortcomings in multilingual support and safety considerations, while steering future advancements in adaptive GUI interaction frameworks.