Evaluating Multi-Modal OS Agents at Scale
The paper presents a comprehensive evaluation of multi-modal operating system (OS) agents, specifically within the Windows environment. LLMs like GPT-4V and GPT-4o have demonstrated significant potential in enhancing human productivity through planning and reasoning tasks across multiple domains. However, evaluating these agents in real-world scenarios remains challenging due to the limitations of existing benchmarks and the time-consuming nature of multi-step evaluations.
Benchmark Design
The authors introduce WindowsAgentArena (WAA), a scalable and reproducible environment designed to evaluate agent performance within a real Windows OS. This environment allows agents to interact with a wide range of applications and tools, similar to how human users would. The benchmark builds upon the OSWorld framework, adapting it to create over 150 diverse tasks that require planning, screen understanding, and tool usage.
The benchmark can be parallelized in Azure, allowing for a full evaluation within 20 minutes, significantly reducing the time compared to traditional benchmarks. The tasks are designed to reflect typical user workflows in Windows, including document editing, web browsing, system tasks, coding, and media consumption.
Multi-Modal Agent: Navi
To demonstrate the capabilities of WAA, the authors introduce Navi, a new multi-modal agent. Navi can understand and navigate the Windows environment autonomously, achieving a success rate of 19.5% compared to 74.5% for human users. Navi also shows strong performance on the Mind2Web benchmark, highlighting its versatility. The agent uses a combination of chain-of-thought prompting and various methods for screen representation, including UIA tree parsing, OCR, icon and image detection, and OmniParser.
Key Contributions
- WindowsAgentArena (WAA): A diverse and scalable benchmark environment tailored for Windows OS.
- Multi-Modal Agent (Navi): A new agent that achieves notable performance across various benchmarks.
- Open-Source Contributions: The code, benchmark, and models are made available to facilitate further research and development.
Methodology
Task Definition and Evaluation
Tasks in WAA are formalized as partially observable Markov decision processes (POMDPs), with well-defined state spaces, observation spaces, and reward functions. The agent's actions are evaluated based on changes in the OS state, ensuring that task completion is the primary criterion. The observation space includes various elements like foreground and background window titles, clipboard content, and screen representations (UIA tree, DOM tree, OCR, etc.).
Action Space
The action space consists of free-form Python code execution and function wrappers for GUI, keyboard, and OS-related tasks. The Computer
class actions allow precise interactions with the OS, such as moving the mouse, clicking, typing, and managing clipboard content.
Results and Analysis
Human Evaluation
Human performance on WAA tasks was measured, resulting in a 74.5% success rate. The evaluation covered tasks across different domains, including document editing, web browsing, and system tasks. On average, tasks took around 8.1 steps to complete, with VLC Player tasks being the most challenging.
Agent Performance
Navi's performance was evaluated across various configurations, showing the importance of accurate Set-of-Marks (SoM) and precise visual-language alignment. The best configuration combined UIA markers with pixel-based models, achieving a 19.5% success rate. Performance gaps between different models were observed, with larger models like GPT-4V-1106 performing better.
Implications and Future Directions
The research highlights several key areas for future development:
- Human-in-the-Loop Systems: Incorporating human feedback to enhance agent performance.
- Specialized Sub-Systems: Developing specialized agents for domain-specific tasks.
- Reinforcement Learning: Leveraging RL for training agents using generated data.
- Predefined Action Libraries: Using action libraries for higher execution precision.
- Safety and Security: Ensuring ethical and responsible AI use.
Conclusion
The introduction of WAA and the development of Navi provide a robust framework for evaluating multi-modal OS agents. The benchmark's scalability and the agent's performance offer significant insights into the potential and challenges of autonomous computer control agents. The open-source contributions aim to accelerate research and development in the field, fostering innovation and collaboration.
The paper presents a structured approach to evaluating multi-modal agents within a realistic OS environment, addressing existing challenges and providing valuable insights for future research.