OSWorld Benchmark Evaluation
- OSWorld Benchmark is a suite of interactive evaluation protocols using real VMs, supporting diverse computer tasks across multiple OS platforms.
- It features execution-based, automated measurement of agent performance on multimodal tasks, ensuring reproducible and detailed assessments.
- The benchmark drives advances in hierarchical planning, reinforcement learning, and safety analysis, guiding the development of robust digital agents.
The OSWorld Benchmark is a suite of real-computer-based evaluation protocols and task environments for assessing generalist digital agents—especially those leveraging large language or vision-LLMs—on open-ended, multimodal computer use tasks. Unlike prior benchmarks that operate in confined or simulated domains, OSWorld provides a scalable, interactive virtual machine environment, supporting execution-based evaluation and automated, reproducible measurement of agent performance on arbitrary desktop and web applications. The benchmark has catalyzed advances in agent architectures, hierarchical planning, reinforcement learning on real UIs, safety assessment, and temporal efficiency analysis.
1. Problem Scope and Motivation
OSWorld was created to address critical deficiencies in previous agent benchmarks: the lack of interactive, executable environments and the limited diversity or scalability of task coverage (Xie et al., 11 Apr 2024). Existing benchmarks often use pre-recorded demonstrations, synthetic or web-only settings, or restrict tasks to isolated UI actions, thus failing to reflect the full complexity and heterogeneity of human-computer interaction. OSWorld’s central aim is to enable rigorous, reproducible evaluation of agents on arbitrary computer tasks, encompassing system utilities, web workflows, office suites, multimedia operations, coding, and multi-application integration, across multiple operating systems (primarily Ubuntu, with experimental support for Windows and macOS).
This expanded scope reflects the need for agents to solve real-world challenges such as app interoperability, GUI grounding, multi-modal observation/action, and long-horizon reasoning—in effect, to advance the state of multimodal generalist agents capable of automating authentic human-computer activity.
2. Task Design, Environment, and Evaluation Protocol
At its core, OSWorld provides a scalable, virtualized environment where agents interact with a real desktop via a Partially Observable Markov Decision Process (POMDP) abstraction (Xie et al., 11 Apr 2024). Each task is precisely defined by:
- Initial state setup: Detailed VM snapshot including OS, pre-opened applications, and pre-populated files to replicate realistic starting conditions.
- Observation space: High-resolution screenshots (e.g., 1920×1080), accessibility trees (XML-formatted structural representations), natural language instructions, and terminal logs.
- Action space: Formalized control via pyautogui-style keyboard and mouse primitives, plus special tokens (WAIT, FAIL, DONE).
- Execution-based evaluation: For each of the 369 tasks (plus Windows variants), there exists a custom script that inspects the resulting state (output artifacts, app settings, web cookies etc.) to determine task success, thus enabling ground-truth independent, reproducible assessment of functional completion.
Task coverage is intentionally broad, spanning OS-level actions (installing, configuring, cleaning), productivity applications (LibreOffice Writer/Calc/Impress), browser tasks (Chrome web automation, cookie management), multimedia usage (VLC), email (Thunderbird), code editing (VS Code), and complex workflows that require data movement and action coordination across applications.
The open source release includes VM images, configurations, baseline agent implementations, and all evaluation scripts—enabling direct replication and community extension.
3. Agent Architectures and Baseline Evaluation
Multiple classes of LLM/VLM-based agents have been benchmarked using OSWorld (Xie et al., 11 Apr 2024, Agashe et al., 10 Oct 2024, Lei et al., 16 May 2025, He et al., 20 May 2025, Lu et al., 22 May 2025, Zhang et al., 28 May 2025):
- Baselines: Direct LLM or VLM prompting (e.g., GPT-4V, Gemini-ProV, Claude-3 Opus, Mixtral, CogAgent) achieves modest success. For instance, while humans succeed on 72.36% of tasks, leading agents attain only ~12.24% (GPT-4V), with pronounced deficit in GUI grounding and multi-app workflow reasoning.
- Hierarchical and modular agents: Agent S leverages an experience-augmented hierarchical planning framework, combining external web knowledge, episodic and narrative memory, and an Agent-Computer Interface (ACI) with explicit multimodal inputs (screenshots and a11y trees), achieving a new state-of-the-art success rate of 20.58% (GPT-4o) (Agashe et al., 10 Oct 2024). InfantAgent-Next employs a modular, multi-tool pipeline with iterative visual grounding and context-tagged memory, achieving 7.27%, outperforming comparable monolithic approaches (Lei et al., 16 May 2025).
- Efficient training and RL-based methods: PC Agent-E demonstrates that careful data augmentation (thought reconstruction, trajectory boosting from a few hundred expert demonstrations) enables remarkable training efficiency, improving OSWorld success from 11.1% to 14.9% (He et al., 20 May 2025). ARPO applies end-to-end reinforcement learning (through augmented Group Relative Policy Optimization with replay buffer and task selection), reaching 29.9% on major OSWorld variants (Lu et al., 22 May 2025).
- Plug-and-play modules: UI-Evol introduces knowledge evolution: by reconstructing objective action sequences post-execution (via change detection in consecutive screenshots) and critiquing initial web knowledge with chain-of-thought LLM reasoning, it increases both OSWorld success (from 19.5% to 22.0%) and reliability (lower behavioral std deviation) (Zhang et al., 28 May 2025).
These experiments collectively demonstrate that success on OSWorld remains challenging for current AI architectures, with error modalities including failed visual grounding, misaligned action planning, and brittle workflow decomposition.
4. Safety, Efficiency, and Auxiliary Benchmarks
OSWorld has driven the development of new benchmarks and analysis protocols targeting agent safety, efficiency, and robustness (Kuntz et al., 17 Jun 2025, Abhyankar et al., 19 Jun 2025):
- OS-Harm: Focuses on agent safety across deliberate misuse, prompt injection attacks, and model misbehavior. It introduces 150 tasks grouped into deliberate user misuse (e.g., harassment, disinformation), prompt injection (including code, email, site-based, and notification attacks), and open-ended misbehavior. OS-Harm integrates an LLM-based automated judge, achieving high agreement with human annotation (F1 ~0.76–0.79), and finds that all tested models (e.g., Claude 3.7 Sonnet) exhibit concerning rates of harmful compliance, especially to direct misuse and static prompt injections.
- OSWorld-Human: Addresses agent efficiency by providing human-generated minimal trajectories (“gold paths”) for each task. Evaluation of 16 state-of-the-art agents after integrating OSWorld-Human reveals that even the best agents take 1.4–2.7 times more steps than necessary, with dominant latency incurred in planning/reflection LLM calls (75–94% of wall time), and later steps further slowed due to cumulative prompt expansion. The Weighted Efficiency Score (WES) is introduced to jointly measure both task completion and resource usage:
- where is the minimal (human) number of steps, is agent steps, is maximum steps, and is success indicator.
This multifaceted benchmarking highlights that not only is accuracy a limiting factor for current agents, but latency, resource use, and safety vulnerabilities remain unresolved bottlenecks to deployment.
5. Methodological Innovations and Technical Architecture
The technical rigor of OSWorld is anchored by (1) its use of executable, real-VM task setups, (2) high-fidelity multimodal representations (joint screenshots, accessibility trees, and logs), and (3) execution-based evaluation scripts unique to each task, supporting both deterministic grading and tolerance for multiple correct agent strategies (Xie et al., 11 Apr 2024).
- POMDP formalism: Tasks are modeled as processes, with agents mapping sequence of multimodal observations to executable UI actions and receiving reward upon successful completion.
- Task construction and annotation: Each of the 369 tasks and 134 evaluation functions was hand-crafted and cross-checked by domain experts (~1800 person-hours), ensuring diversity and realism (including single-app, multi-app, and open-domain workflows).
- Automated evaluation: Custom scripts query app state, file outputs, and system configs to confirm completion. For safety and nuanced evaluation (e.g., OS-Harm), LLM-powered semantic judges are used to automate compliance inspection and are shown to match human judgments at high consistency.
A key methodological insight is that this architecture allows for the fair comparison of agent strategies (not limited to prescriptive step lists), supports generalization analysis across OS platforms, and exposes error distribution at fine granularity (per-step, per-task, per-domain).
6. Impact, Extensions, and Future Directions
OSWorld has rapidly become the standard for the evaluation of digital agents in real computer environments, underpinning dozens of architectural innovations (Xie et al., 11 Apr 2024, Agashe et al., 10 Oct 2024, Lei et al., 16 May 2025, He et al., 20 May 2025, Lu et al., 22 May 2025, Zhang et al., 28 May 2025, Kuntz et al., 17 Jun 2025, Abhyankar et al., 19 Jun 2025). Its impacts include:
- Comparative agent benchmarking: Facilitates fair and reproducible assessment of agent capabilities in open-ended computer use, underpinning scientific progress by exposing true error modalities.
- Catalyst for related testbeds: Motivated broader safety (OS-Harm), temporal/efficiency (OSWorld-Human), knowledge refinement (UI-Evol), and formal RL (ARPO) benchmarks/extensions.
- Guidance for agent design: Error/latency analysis (e.g., GUI grounding bottlenecks, planning latency expansion) guides future research on multimodal perception, hierarchical plan compression, memory/context management, efficiency via prompt engineering, and RL with delayed feedback.
- Interoperability and open science: The open-source release of code, data, VM images, and scripts supports community adoption, cross-institution replication, and extension; documentation at https://os-world.github.io.
Extensions focus on improved cross-OS support, granular error reporting, refinement of LLM observer/controller loops, learning reward models for more efficient RL, and direct safety alignment (e.g., I/O sandboxing, oversight, prompt-injection defense).
A plausible implication is that further progress may require advancing not only agent model architectures but also meta-learning and trajectory compression to address the efficiency and latency gap with human-level operation identified by OSWorld-Human (Abhyankar et al., 19 Jun 2025).
7. Summary Table: Core Attributes of OSWorld
Dimension | Characteristic | Source/Role |
---|---|---|
Environment | Real VM (Ubuntu, Windows, Mac), full stack OS, interactive | (Xie et al., 11 Apr 2024) |
Task coverage | 369 standard tasks, web/office/media/dev apps, multi-app workflows | (Xie et al., 11 Apr 2024) |
Observation | High-res screenshots, accessibility trees, terminal logs | (Xie et al., 11 Apr 2024) |
Action space | Pyautogui-style keyboard/mouse, structured command, special tokens | (Xie et al., 11 Apr 2024) |
Evaluation | Execution-based scripts per task, LLM-based judges for safety | (Xie et al., 11 Apr 2024, Kuntz et al., 17 Jun 2025) |
Success metrics | Task completion, safety (OS-Harm), efficiency (WES, OSWorld-Human) | (Xie et al., 11 Apr 2024, Kuntz et al., 17 Jun 2025, Abhyankar et al., 19 Jun 2025) |
Open source | Code, data, VMs, models; extensible and community-driven | https://os-world.github.io |
OSWorld thus sets a rigorous standard for the holistic evaluation of real-world computer use agents, bridging prior gaps in authenticity, reproducibility, diversity, safety, and temporal efficiency, while offering a robust platform for the continued advancement of digital agent research.