Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale (2409.08264v2)

Published 12 Sep 2024 in cs.AI

Abstract: LLMs show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

PDF HTML Abstract

Evaluating Multi-Modal OS Agents at Scale

The paper presents a comprehensive evaluation of multi-modal operating system (OS) agents, specifically within the Windows environment. LLMs like GPT-4V and GPT-4o have demonstrated significant potential in enhancing human productivity through planning and reasoning tasks across multiple domains. However, evaluating these agents in real-world scenarios remains challenging due to the limitations of existing benchmarks and the time-consuming nature of multi-step evaluations.

Benchmark Design

The authors introduce WindowsAgentArena (WAA), a scalable and reproducible environment designed to evaluate agent performance within a real Windows OS. This environment allows agents to interact with a wide range of applications and tools, similar to how human users would. The benchmark builds upon the OSWorld framework, adapting it to create over 150 diverse tasks that require planning, screen understanding, and tool usage.

The benchmark can be parallelized in Azure, allowing for a full evaluation within 20 minutes, significantly reducing the time compared to traditional benchmarks. The tasks are designed to reflect typical user workflows in Windows, including document editing, web browsing, system tasks, coding, and media consumption.

Multi-Modal Agent: Navi

To demonstrate the capabilities of WAA, the authors introduce Navi, a new multi-modal agent. Navi can understand and navigate the Windows environment autonomously, achieving a success rate of 19.5% compared to 74.5% for human users. Navi also shows strong performance on the Mind2Web benchmark, highlighting its versatility. The agent uses a combination of chain-of-thought prompting and various methods for screen representation, including UIA tree parsing, OCR, icon and image detection, and OmniParser.

Key Contributions

WindowsAgentArena (WAA): A diverse and scalable benchmark environment tailored for Windows OS.
Multi-Modal Agent (Navi): A new agent that achieves notable performance across various benchmarks.
Open-Source Contributions: The code, benchmark, and models are made available to facilitate further research and development.

Methodology

Task Definition and Evaluation

Tasks in WAA are formalized as partially observable Markov decision processes (POMDPs), with well-defined state spaces, observation spaces, and reward functions. The agent's actions are evaluated based on changes in the OS state, ensuring that task completion is the primary criterion. The observation space includes various elements like foreground and background window titles, clipboard content, and screen representations (UIA tree, DOM tree, OCR, etc.).

Action Space

The action space consists of free-form Python code execution and function wrappers for GUI, keyboard, and OS-related tasks. The Computer class actions allow precise interactions with the OS, such as moving the mouse, clicking, typing, and managing clipboard content.

Results and Analysis

Human Evaluation

Human performance on WAA tasks was measured, resulting in a 74.5% success rate. The evaluation covered tasks across different domains, including document editing, web browsing, and system tasks. On average, tasks took around 8.1 steps to complete, with VLC Player tasks being the most challenging.

Agent Performance

Navi's performance was evaluated across various configurations, showing the importance of accurate Set-of-Marks (SoM) and precise visual-language alignment. The best configuration combined UIA markers with pixel-based models, achieving a 19.5% success rate. Performance gaps between different models were observed, with larger models like GPT-4V-1106 performing better.

Implications and Future Directions

The research highlights several key areas for future development:

Human-in-the-Loop Systems: Incorporating human feedback to enhance agent performance.
Specialized Sub-Systems: Developing specialized agents for domain-specific tasks.
Reinforcement Learning: Leveraging RL for training agents using generated data.
Predefined Action Libraries: Using action libraries for higher execution precision.
Safety and Security: Ensuring ethical and responsible AI use.

Conclusion

The introduction of WAA and the development of Navi provide a robust framework for evaluating multi-modal OS agents. The benchmark's scalability and the agent's performance offer significant insights into the potential and challenges of autonomous computer control agents. The open-source contributions aim to accelerate research and development in the field, fostering innovation and collaboration.

The paper presents a structured approach to evaluating multi-modal agents within a realistic OS environment, addressing existing challenges and providing valuable insights for future research.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Rogerio Bonatti (24 papers)
Dan Zhao (50 papers)
Francesco Bonacci (3 papers)
Dillon Dupont (1 paper)
Sara Abdali (14 papers)
Yinheng Li (14 papers)
Justin Wagle (4 papers)
Kazuhito Koishida (22 papers)
Arthur Bucker (7 papers)
Lawrence Jang (6 papers)
Zack Hui (1 paper)
Yadong Lu (19 papers)

Citations (6)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/francedot/status/1841476239149719588

https://twitter.com/francedot/status/1846582264466935994

https://twitter.com/francedot/status/1849433048212705465

https://twitter.com/francedot/status/1848768641896005936

https://twitter.com/taziku_co/status/1834732143349825711

https://twitter.com/arXivGPT/status/1835752650232693191