Papers
Topics
Authors
Recent
2000 character limit reached

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Published 12 Sep 2024 in cs.AI | (2409.08264v2)

Abstract: LLMs show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

Citations (6)

Summary

  • The paper introduces WindowsAgentArena, a reproducible benchmark that simulates real Windows workflows for evaluating multi-modal OS agents.
  • It presents Navi, a multi-modal agent using chain-of-thought prompting and advanced screen representation techniques, achieving a 19.5% success rate.
  • The study emphasizes open-source tools and outlines future directions like reinforcement learning and human-in-the-loop methods to enhance agent performance.

Evaluating Multi-Modal OS Agents at Scale

The paper presents a comprehensive evaluation of multi-modal operating system (OS) agents, specifically within the Windows environment. LLMs like GPT-4V and GPT-4o have demonstrated significant potential in enhancing human productivity through planning and reasoning tasks across multiple domains. However, evaluating these agents in real-world scenarios remains challenging due to the limitations of existing benchmarks and the time-consuming nature of multi-step evaluations.

Benchmark Design

The authors introduce WindowsAgentArena (WAA), a scalable and reproducible environment designed to evaluate agent performance within a real Windows OS. This environment allows agents to interact with a wide range of applications and tools, similar to how human users would. The benchmark builds upon the OSWorld framework, adapting it to create over 150 diverse tasks that require planning, screen understanding, and tool usage.

The benchmark can be parallelized in Azure, allowing for a full evaluation within 20 minutes, significantly reducing the time compared to traditional benchmarks. The tasks are designed to reflect typical user workflows in Windows, including document editing, web browsing, system tasks, coding, and media consumption.

Multi-Modal Agent: Navi

To demonstrate the capabilities of WAA, the authors introduce Navi, a new multi-modal agent. Navi can understand and navigate the Windows environment autonomously, achieving a success rate of 19.5% compared to 74.5% for human users. Navi also shows strong performance on the Mind2Web benchmark, highlighting its versatility. The agent uses a combination of chain-of-thought prompting and various methods for screen representation, including UIA tree parsing, OCR, icon and image detection, and OmniParser.

Key Contributions

  1. WindowsAgentArena (WAA): A diverse and scalable benchmark environment tailored for Windows OS.
  2. Multi-Modal Agent (Navi): A new agent that achieves notable performance across various benchmarks.
  3. Open-Source Contributions: The code, benchmark, and models are made available to facilitate further research and development.

Methodology

Task Definition and Evaluation

Tasks in WAA are formalized as partially observable Markov decision processes (POMDPs), with well-defined state spaces, observation spaces, and reward functions. The agent's actions are evaluated based on changes in the OS state, ensuring that task completion is the primary criterion. The observation space includes various elements like foreground and background window titles, clipboard content, and screen representations (UIA tree, DOM tree, OCR, etc.).

Action Space

The action space consists of free-form Python code execution and function wrappers for GUI, keyboard, and OS-related tasks. The Computer class actions allow precise interactions with the OS, such as moving the mouse, clicking, typing, and managing clipboard content.

Results and Analysis

Human Evaluation

Human performance on WAA tasks was measured, resulting in a 74.5% success rate. The evaluation covered tasks across different domains, including document editing, web browsing, and system tasks. On average, tasks took around 8.1 steps to complete, with VLC Player tasks being the most challenging.

Agent Performance

Navi's performance was evaluated across various configurations, showing the importance of accurate Set-of-Marks (SoM) and precise visual-language alignment. The best configuration combined UIA markers with pixel-based models, achieving a 19.5% success rate. Performance gaps between different models were observed, with larger models like GPT-4V-1106 performing better.

Implications and Future Directions

The research highlights several key areas for future development:

  • Human-in-the-Loop Systems: Incorporating human feedback to enhance agent performance.
  • Specialized Sub-Systems: Developing specialized agents for domain-specific tasks.
  • Reinforcement Learning: Leveraging RL for training agents using generated data.
  • Predefined Action Libraries: Using action libraries for higher execution precision.
  • Safety and Security: Ensuring ethical and responsible AI use.

Conclusion

The introduction of WAA and the development of Navi provide a robust framework for evaluating multi-modal OS agents. The benchmark's scalability and the agent's performance offer significant insights into the potential and challenges of autonomous computer control agents. The open-source contributions aim to accelerate research and development in the field, fostering innovation and collaboration.

The paper presents a structured approach to evaluating multi-modal agents within a realistic OS environment, addressing existing challenges and providing valuable insights for future research.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 22 tweets with 80 likes about this paper.