OSWorld and WindowsAgentArena Benchmark

Updated 22 October 2025

OSWorld and WindowsAgentArena are unified, execution-based benchmarking platforms that evaluate multimodal digital agents across Ubuntu and Windows environments.
They employ detailed VM snapshots, custom task scripts, and execution-based reward functions to simulate realistic multi-application scenarios.
These platforms reveal significant performance gaps between state-of-the-art agents and humans, guiding future research in grounding, operational knowledge, and cross-OS generalization.

OSWorld is a unified, execution-based benchmarking environment for evaluating multimodal digital agents in realistic computer settings, supporting task setup and execution-based evaluation across Ubuntu, Windows, and (in principle) macOS. WindowsAgentArena is an extension and specialization of this framework, offering a large-scale, reproducible benchmark for multi-modal agent evaluation and control specifically in the Windows operating system. Together, these platforms represent the modern standard for systematic agent assessment in real-world, open-ended, multi-application computer environments.

1. Design and Objectives

OSWorld is engineered as the first scalable, fully executable real computer environment for benchmarking multimodal agents (LLM-, VLM-, and tool-augmented systems) in open-ended computing tasks. Each agent operates as an autonomous user, receiving partial observations (screenshots, accessibility trees, or terminal output) and interacting with the environment via raw mouse and keyboard actions. The core of OSWorld is a partially observable Markov Decision Process (POMDP) defined as: $(\mathcal{S}, \mathcal{O}, \mathcal{A}, \mathcal{T}, \mathcal{R})$ where each step involves an observation $o_t \in \mathcal{O}$ , an action $a_t \in \mathcal{A}$ , a transition $\mathcal{T}$ , and an execution-based reward $\mathcal{R}: \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]$ . Task execution is performed in virtualized OS environments (primarily Ubuntu and Windows), each initialized in a reproducible state with custom file, application, and UI configurations.

WindowsAgentArena directly adapts the OSWorld framework to the Windows ecosystem (file paths, PowerShell scripting, UI Automation trees, and app suite), expanding task diversity and bridging the reproducibility and evaluation requirements unique to Windows-based workflows (Xie et al., 11 Apr 2024, Bonatti et al., 12 Sep 2024).

2. Benchmarks, Tasks, and Execution Protocols

The OSWorld benchmark comprises 369 Ubuntu tasks and at least 43 Windows tasks, with each example reflecting real-world user scenarios such as document editing, email handling, data manipulation, and workflows that span multiple applications (e.g., transferring data from email to a spreadsheet via browser and desktop software). Each task is specified by:

A machine state configuration (VM snapshot, file staging, app launch, and UI preparation).
A detailed execution-based evaluation script tailored to the desired end state. These scripts can check file modification, accessibility tree properties, cookies, or application-specific outcomes.
A natural language instruction that encodes the user intent.

WindowsAgentArena extends this paradigm, supplying over 150 Windows-centric multi-step tasks. Task specifications and environment setup mirror the OSWorld methodology—initial states via VM management, scenario-specific evaluation logic, and agent observations encompassing screenshots, UIA accessibility trees, DOMs for web content, and composite "Set-of-Marks" features extracted through OCR and icon detection (Bonatti et al., 12 Sep 2024).

Task execution proceeds in an interactive loop: the agent observes its current environment, chooses and applies an action, observes the new state and reward, and continues until a termination token or the step budget is reached. This execution-based, black-box evaluation enforces reproducibility, platform-agnostic benchmarking, and robust multi-step assessment.

3. Core Challenges, Evaluation, and Observed Agent Performance

A critical finding from OSWorld and WindowsAgentArena is the pronounced gap in agent performance relative to humans, illuminating fundamental challenges in grounding, operational knowledge, and sequential planning.

On OSWorld, even advanced agents like GPT-4, Claude, or Mixtral achieve only ~12–12.5% task success; humans complete >72.3% (Xie et al., 11 Apr 2024).
In WindowsAgentArena, the state-of-the-art Navi agent achieves 19.5% (vs. 74.5% human), and the best agents across models rarely exceed 20–22%.
Analysis of agent trials identifies three principal deficiencies:
- GUI Grounding: Agents frequently miscalculate click coordinates, fail to recognize dynamic UI elements, and cannot reliably map visual or a11y inputs to correct actions.
- Operational Knowledge: Agents lack comprehensive understanding of application workflows, keyboard shortcuts, and error recovery, especially across multi-app or context-switching tasks.
- Scalability to High-Resolution Observations: 1920×1080 screenshots and multi-thousand-token a11y trees overwhelm context and model capacity.
These findings are reinforced by detailed custom evaluation logs and by instance-specific reward scripts that confirm genuine task completion—alerting researchers to frequent failures in interface grounding, step ordering, and error handling.

4. Infrastructure and Platform Differences

OSWorld is architected for OS-agnostic evaluation, featuring rich APIs for both Ubuntu and Windows environments:

Capability	OSWorld (Ubuntu)	OSWorld-Windows/WindowsAgentArena
Task Setup	VM snapshots, pyautogui	VM snapshots, pywinauto, UIA
File and App I/O	UNIX filesystem, native	Windows filesystem, PowerShell, native
Observations	Screenshot, a11y-tree	Screenshot, UIA tree, DOM, Set-of-Marks
Evaluation	Per-task Python scripts	Per-task Python scripts
Domain/Application Scope	~8 domains, broad	150+ Windows tasks, Windows-specific

WindowsAgentArena introduces higher task diversity and exploits cloud infrastructure (Azure ML, Dockerized VM containers) to parallelize cold-start task evaluation, dropping full-benchmark runtime to as little as 20 minutes (Bonatti et al., 12 Sep 2024).

Notably, while OSWorld supports cross-OS generalization, comprehensive Windows testing (as in WindowsAgentArena) requires specialized adaptation of task initializers, PowerShell scripting, and nuanced extraction of accessibility information through Windows-native tools (e.g., pywinauto, UIA).

5. Implications for Multimodal Agent Design and Generalization

OSWorld and WindowsAgentArena experiments expose the limitations of current state-of-the-art multimodal agents and establish research directions for next-generation generalist agents:

Spatial and Semantic Grounding: Agents must integrate screenshot vision, a11y trees, DOMs, and inference from tool-specific contexts to reliably ground actions—simple click or text-matching routines are insufficient.
Execution-based Interactive Learning: Execution-driven feedback (rather than demonstration or synthetic stepwise datasets) enables agents to practice long-horizon planning, attempt rollbacks, experiment with subgoal decomposition, and learn from intermediate failures.
Cross-OS Robustness: Performance correlations (e.g., Pearson r ~0.7 between Ubuntu/Windows tasks in OSWorld) suggest moderate transfer but highlight the need for explicit cross-platform generalization strategies (e.g., interface abstraction, memory-augmented planning).
Reproducible, Reconfigurable Evaluation: The use of VM snapshots, fine-grained per-example script-based reward functions, and explicit action logging enables rigorous, comparable evaluation across research teams and hardware stacks—contrasting with earlier synthetic or browser-only benchmarks.

6. Comparative Outcomes and Future Directions

While OSWorld provides the reference implementation of the execution-based, multi-OS multitask real-computer agent benchmark, WindowsAgentArena accelerates the research cycle by:

Scaling evaluation reliably in large parallel clusters, catalyzing data generation and RL signal enrichment via cloud deployment.
Introducing a task subset focused on Windows-specific workflows, facilitating systematic ablation analysis, cross-OS error studies, and domain transfer research.

The persistent performance gap (human >72%, best models frequently <22%) and platform transferability issues direct future work toward:

Improved grounding using richer multimodal fusion of screenshots, structured accessibility, and domain-specific priors;
Advanced exploration- and memory-augmented architectures capable of multi-stage, self-corrective planning;
Standardized benchmarks with dynamic initial states and reproducible validation suites (as inspired by OSWorld/WindowsAgentArena but under continual extension by successors such as WorldGUI (Zhao et al., 12 Feb 2025)).

7. Technical Formulation and Mathematical Underpinnings

A canonical OSWorld/WindowsAgentArena task is defined as a POMDP, instantiated by: $o_t \in \mathcal{O} = \{ \text{screenshot}, \text{a11y tree}, \text{terminal output} \}, \quad a_t \in \mathcal{A} = \{ \text{click}, \text{type}, \text{drag}, ... \}$

$\mathcal{R}: \mathcal{S} \times \mathcal{A} \rightarrow [0,1]$

where the reward $\mathcal{R}$ is computed by per-example scripts extracting latent signals (file diffs, UI trees, background process completion) in the VM. This POMDP formalism secures agent generality, enabling reproducible, multi-stage task evaluation and providing a shared language for comparing agent and human trajectories across both OSWorld and WindowsAgentArena.

OSWorld and WindowsAgentArena collectively underpin the current benchmark standard for multimodal, real-computer agent research. Through realistic task design, execution-based evaluation, and rigorous cross-OS protocol alignment, these platforms expose and quantify critical gaps in contemporary agent capability, driving innovation toward robust, human-level computer use agents.