OSWorld: Benchmark for Multimodal Agents

Updated 3 October 2025

OSWorld is a scalable, execution-driven benchmark that evaluates autonomous agents performing open-ended desktop tasks using human-like interactions.
It incorporates diverse tasks across Ubuntu, Windows, and macOS, mimicking complex digital workflows with realistic input streams and in-progress states.
Execution-based evaluation metrics reveal significant gaps in agent performance, highlighting challenges in GUI grounding, operational knowledge, and long-horizon planning.

OSWorld is a comprehensive and execution-driven benchmarking suite for evaluating the capabilities of multimodal autonomous agents in realistic computer environments. It establishes a unifying platform for assessing open-ended desktop tasks conducted via raw human-like interactions—mouse and keyboard control—across multiple operating systems. By synthesizing both the complexity and diversity of real-world digital workflows, and combining them with robust, execution-based evaluation, OSWorld has become a de facto benchmark for agentic research in general computer use.

1. Definition, Scope, and Motivation

OSWorld is defined as the first scalable, real computer environment and benchmark for multimodal agents capable of performing open-ended computer tasks with minimal human intervention. The environment supports full desktop control in Ubuntu (with extensions to Windows and macOS) using realistic input streams (e.g., pyautogui-driven mouse and keyboard events), permits initialization from rich “in-progress” states, and provides multimodal observations (screenshots, accessibility (a11y) trees, and terminal outputs). It accommodates a wide spectrum of tasks, from file I/O and browser automation to complex multi-application workflows.

The central motivation for OSWorld’s design is its recognition of the insufficiency of prior benchmarks, which generally restricted agents to static, simulated, or domain-specific applications (such as web-only or mobile UI simulators). These earlier efforts impeded investigation into agent scalability and multimodality due to their limited operational context and lack of execution-grounded evaluation.

2. Benchmark Construction and Evaluation Protocol

Task Suite and Configuration

The main OSWorld benchmark comprises 369 hand-annotated tasks on Ubuntu and an additional set of 43 tasks for Windows. Task diversity includes but is not limited to:

Single-application operations (LibreOffice, VLC, Thunderbird, Chrome, VS Code, GIMP)
Integrated multi-app workflows (cross-application pipelines)
OS-level administration and file management
A small proportion (≈8%) of infeasible scenarios, testing an agent's capacity for recognizing unachievable requests

Each task is initialized from a reproducible state using a hybrid VM configuration that combines base OS snapshots, file manipulations, and scripted GUI actions or commands (e.g., resizing windows, launching applications). This process ensures agents are tested on realistic partial-completion scenarios, mirroring how humans encounter tasks mid-flow.

Execution-Based Evaluation

Each OSWorld task includes a custom script that executes after agent completion to validate the actual outputs using a suite of 134 unique evaluation functions. These functions access ground-truth data through getter utilities (e.g., retrieving file contents, a11y tree states, browser cookies) and are combined with logical evaluators that flexibly support alternative correct paths. The following sample illustrates their use:

1
2
3

cookie_data = get_cookie_data(env)
rule = {"type": "domains", "domains": [".amazon.com"]}
assert is_cookie_deleted(cookie_data, rule)

This execution-based approach, as opposed to static log or action-step comparison, delivers reliable and reproducible outcome metrics, and supports evaluation of generalization in agentic solution space.

3. Performance Analysis of State-of-the-Art Agents

Task Success Rates and Comparative Metrics

Extensive OSWorld benchmarks show a profound agent–human gap:

Humans: 72.36% average success rate
Best LLM/VLM models (e.g., GPT-4, Mixtral, CogAgent): 12.24% success rate

Performance varies by modality—input spaces such as mixed screenshots + filtered a11y trees (for grounding) can provide modest improvements for some architectures, but overall agent performance remains far lower than human baselines across all categories.

Specific Failure Modes

Three principal challenges are consistently observed:

GUI Grounding: Agents mispredict click coordinates, make repeated misclicks, and fail to robustly map high-level instructions to precise GUI actions—especially in cluttered, dynamic interfaces.
Operational Knowledge: Agents lack robust command of application semantics, resulting in unproductive trial-and-error (e.g., using the wrong menu or toggling incorrect settings).
Long-Horizon Planning: Success rates sharply drop for workflows spanning more steps or multiple software tools, reflecting difficulties in memory management, context tracking, and subgoal decomposition.

The library of flexible, multi-modal evaluation scripts also exposes systematic errors not observable with simpler benchmarks, such as failures to recover from environment noise (unexpected dialogs, application state anomalies).

4. Insights and Influence on Agent Research

OSWorld has foundational implications for agent research:

Execution-based evaluation reveals that high-performing vision-language and instruction-following LLMs, while impressive on static web or language tasks, lack key abilities for digital assistance: grounding, strategic error recovery, and robust long-horizon reasoning.
The analysis supports a shift toward architectures integrating richer visual perception (screenshots + a11y trees), explicit GUI grounding modules, and memory or reflection mechanisms.
By enabling in-depth introspection (systematic exposure of misclicks, step reversals, and planning errors), OSWorld’s design motivates advances in grounding precision, operational knowledge representation, and interactive corrective feedback during runtime.

Innovations such as experience-augmented hierarchical planning (Agashe et al., 10 Oct 2024), multi-agent collaboration (Jia et al., 24 Oct 2024), and hybrid GUI+code control (Song et al., 5 Aug 2025) have all leveraged OSWorld for evaluation, directly linking benchmark structure to methodological progress.

OSWorld has catalyzed a series of related efforts:

OSWorld-G: A detailed GUI grounding suite with 564 annotated samples across text matching, element recognition, layout understanding, and fine-grained manipulation. It is paired with the Jedi dataset for large-scale grounding model training (Xie et al., 19 May 2025).
OSWorld-Human: A manually curated gold-standard set of human action trajectories for every OSWorld task, establishing efficiency baselines and enabling evaluation of agent temporal performance—revealing that top agents take 1.4–2.7× more steps than necessary (Abhyankar et al., 19 Jun 2025).
OS-Harm: A safety benchmark adding 150 tasks that test deliberate misuse, prompt injection, and model misbehavior in the OSWorld environment, along with an automated LLM judge for scoring safety and compliance (Kuntz et al., 17 Jun 2025).

These extensions collectively enable research on GUI grounding, behavioral efficiency, safety compliance, and meta-level evaluation (introspection via narratives or chain-of-thought).

6. Access, Resources, and Community Use

All core OSWorld resources are publicly available:

Full codebase and documentation
Virtual machine images and task setup scripts
Evaluation functions and baseline agent code

The benchmark can be cloned, extended, or adapted to new operating systems and applications. Researchers are encouraged to contribute task additions, alternative evaluations, and agent implementations. Community adoption is evidenced by wide usage in published works on agentic scaling (Gonzalez-Pumariega et al., 2 Oct 2025), hierarchical planning (Agashe et al., 10 Oct 2024), multi-agent orchestration (Jia et al., 24 Oct 2024), GUI grounding (Xie et al., 19 May 2025), reinforcement learning (Lu et al., 22 May 2025), and more.

7. Significance and Outlook

OSWorld has established itself as a canonical testbed for developing and measuring progress in multimodal, generalist AI agents for open-ended desktop use. By focusing on execution-based, reproducible, and punitive evaluation—coupled with a wide, application-centric task spectrum—it foregrounds the next research frontiers: robust grounding, memory augmentation, dynamic feedback, and safety. Its continued evolution and extensions (e.g., in safety, efficiency, and generalization) ensure that it will remain central to the paper and deployment of autonomous computer-use agents for the foreseeable future.