OSWorld: Multimodal Agent Benchmark

Updated 26 December 2025

OSWorld is a large-scale, execution-based research environment designed for evaluating multimodal agents on real desktop workflows.
It integrates VM-based simulation, multimodal observation, and precise action execution to support diverse GUI and CLI tasks.
The platform offers rigorous, reproducible metrics and extensive task suites, enabling detailed analysis of agent performance and efficiency.

OSWorld is a large-scale, execution-based research environment and benchmark designed for the evaluation and development of multimodal computer-use agents—autonomous systems tasked with completing real desktop-computing workflows via live GUI and CLI interaction. OSWorld exposes agents to diverse and realistic desktop tasks on actual operating systems, quantifying success through programmatic, execution-based validation. The environment supports scalable, reproducible research on open-ended computer control, encompassing vision-language grounding, sequential planning, action execution, tool invocation, and efficiency measurement on tasks that require substantive understanding of desktop applications and OS workflows (Xie et al., 2024, Abhyankar et al., 19 Jun 2025, Andreux et al., 22 Oct 2025).

1. System Architecture and Environment Formalism

OSWorld instantiates a fully executable system architecture comprising the following core components:

Coordinator/Environment Server: Manages task configuration and orchestrates VM-based desktop instances for each agent episode.
VM-Based Simulator: Executes a real OS (Ubuntu, Windows, or macOS) in virtualized environments (e.g., QEMU, VirtualBox) with automated scripting for state resets, file I/O, and application launch (Xie et al., 2024).
Multimodal Observation Interface: Delivers full-resolution screenshots (RGB, up to 1920×1080) and, optionally, accessibility trees (A11y) or Set-of-Marks overlays. No underlying DOM or widget trees are assumed at inference time in the standard protocol (Abhyankar et al., 19 Jun 2025, Li et al., 28 Sep 2025).
Action Execution Layer: Accepts and issues atomic GUI actions—mouse pointer events, keyboard/shortcut inputs, scrolls, drags, and, for command-line tasks, direct CLI text input. Actions are executed inside the VM via automation APIs such as pyautogui or xdotool, with precise low-level control.
Post-Processing & Evaluation: After agent termination or step-exhaustion, the environment collects artifacts (files, logs, a11y trees) and executes custom Python scripts for exact, reproducible task validation (Xie et al., 2024).

Formally, the environment is modeled as a finite-horizon Markov decision process $\mathcal{M}=(\mathcal{S},\mathcal{A},P,R,\gamma=1)$ with visual state space (screenshots, $s_t$ ), atomic action space $\mathcal{A}$ , and sparse terminal reward $R$ from programmatic validators (Li et al., 28 Sep 2025, Yan et al., 17 Dec 2025).

2. Task Distribution, Domains, and Dataset Design

The benchmark task suite comprises 369 real-world computer-use scenarios:

Domains and Applications: Coverage spans Chromium (browser), LibreOffice (Writer, Calc, Impress), GIMP (image editing), VS Code (IDE), Thunderbird (email), VLC (media player), OS-level command-line and file tasks, and multi-application workflows (Xie et al., 2024, Abhyankar et al., 19 Jun 2025).
Task Specification: Each task presents a natural-language goal and an optional "source" with explicit procedural steps. States are initialized with files, documents, or window arrangements to simulate realistic mid-workflow entry points.
Categories:
- Office/app productivity (document, spreadsheet, presentation editing)
- System/file management (navigation, renaming, scripting)
- Cross-application pipelines (e.g., extract data, insert into spreadsheet, email as attachment)
- GUI and CLI interaction modes; Windows and macOS variants exist but Ubuntu is primary (Yan et al., 17 Dec 2025).
Data Provenance: Tasks are sourced from user Q&A, forums, documentation, prior benchmarks (e.g., NL2Bash), and include infeasible and adversarial cases to stress generalization (Xie et al., 2024).

In the OSWorld-G variant, 564 samples with full bounding box annotation support the isolated study of GUI grounding across 32 UI element types and five interaction classes ("Text Matching", "Element Recognition", "Layout Understanding", "Fine-grained Manipulation", "Refusal") (Xie et al., 19 May 2025, Yang et al., 8 Jul 2025).

3. State, Action, and Observation Spaces

State and Observations: At each timestep $t$ , the agent receives $(s_t, x, h_{<t})$ where $s_t$ is the screenshot, $x$ is the task instruction, and $h_{<t}$ is the action/observation history (textualized and, for some models, includes prior screenshot embeddings) (Li et al., 28 Sep 2025, Yan et al., 17 Dec 2025).
Action Space: Atomic GUI primitives include:
- Mouse: click(x,y), double_click(x,y), drag(x_1,y_1\rightarrow x_2,y_2)
- Keyboard: type(str), hotkey(combo), input_text(...)
- Gestures: scroll(dir), move_to(x,y), wait()
- Window/app control: awake() (Yan et al., 17 Dec 2025)
- Meta-actions: finished(result), terminate(status)
- For OSWorld-MCP, higher-level tool invocation via Model Context Protocol (MCP) is also available, exposing JSON-RPC calls to OS/application APIs (Jia et al., 28 Oct 2025).
Dynamics: Each action executes in the live OS VM, deterministically transitions the visible state, and renders a new screenshot for the subsequent timestep. In GUI tasks, actions are irreversible; speculative look-ahead is infeasible (Yang et al., 8 Jul 2025).

4. Evaluation Protocols and Efficiency Metrics

Task Success: The canonical metric is binary task success, defined by execution-based validation scripts that inspect artifacts and OS/application state. Fractional rewards may be granted if supported by evaluation code.
Step and Temporal Efficiency: OSWorld-Human (Abhyankar et al., 19 Jun 2025) introduces:
- Step Efficiency Ratio $\eta$ : $\eta = \frac{\text{agent steps}}{\text{human-expected steps}}$ ; $\eta > 1$ indicates less efficient action sequences than the human oracle.
- Weighted Efficiency Score (WES): $WES^+ = \sum_t r_t \cdot (t_\mathrm{exp}/t_\mathrm{act})$ , $WES^- = -\sum_t (1 - r_t)\cdot (t_\mathrm{act}/S)$ . Rewards efficient, correct completion and penalizes protracted failures.
- Latency Analysis: $L_\mathrm{total} = \sum_{i=1}^N L_i$ , where $L_i$ is wall-clock per-step time. Per-step latency grows linearly with context ( $L_{i+1} \approx 3 L_i$ at later steps), largely due to LLM context prefill.
Human Baseline: Human agents succeed on ~72.36% of tasks with median completion times of ~112s/task, setting a de facto reference standard (Xie et al., 2024, Andreux et al., 22 Oct 2025).
Additional Metrics:
- Pass@k: Fraction of tasks for which at least one of $k$ runs succeeds.
- Tool Invocation Rate (TIR) (Jia et al., 28 Oct 2025): Ratio of tasks where tool invitations are used appropriately in MCP-augmented tasks.
- Average Completion Steps (ACS): Average number of atomic actions required for successful completion.

Model	OSWorld Success Rate (GUI-only, %)	OSWorld-MCP Success Rate (GUI+MCP, %)	Pass@5/Pass@10 (%)
Human	72.4	–	–
Surfer 2	60.1	–	72.0 / 77.0
Step-GUI-8B	48.5	–	–
DART-GUI-7B	42.1 (30 steps task cap)	–	–
GTA1-7B	45.2	–	–
GUI-Owl-7B	29.4	–	–
MobileAgent-v3	37.7	–	–
Best MCP	–	35.3 (Claude 4 Sonnet)	–

No model in the open GUI-only regime matches single-run human baseline without retries; with pass@5–10 retries, state-of-the-art approaches closely match or surpass human-level success (Andreux et al., 22 Oct 2025, Yan et al., 17 Dec 2025, Yang et al., 8 Jul 2025).

5. Foundation Models, Agent Architectures, and Learning Algorithms

Vision-Language Planning and Grounding: Modern agents (e.g., Surfer 2, Mobile-Agent-v3, GUI-Owl, DART-GUI) operate via fully visual observation (raw pixel input, no a11y tree at inference), with separate planning and grounding modules (Andreux et al., 22 Oct 2025, Ye et al., 21 Aug 2025). Chain-of-thought planning is often explicitly modeled, decoupled from low-level action grounding (Andreux et al., 22 Oct 2025).
Hierarchical Context and Memory: Persistent internal state tracks goals, past actions, and environmental notes. This context amortizes exploration and reduces redundant actions over long horizons (Andreux et al., 22 Oct 2025).
Self-Verification: Validator modules assess predicted effects against validator scripts and prompt replanning on misprediction to reduce error cascades (Andreux et al., 22 Oct 2025).
Reinforcement Learning Frameworks: Efficient multi-turn RL underlies recent breakthroughs:
- GRPO/TRPO/ARPO: Group Relative Policy Optimization and variants use clipped PPO-style objectives and (optionally) replay buffers for stabilizing multi-task and long-horizon agent learning (Lu et al., 22 May 2025, Li et al., 28 Sep 2025, Ye et al., 21 Aug 2025).
- Reward Structure: Training incorporates sparse terminal rewards, dense geometric/semantic rewards for action accuracy, and soft LLM-judge capability signals (Yan et al., 17 Dec 2025).
- Asynchronous Training Pipelines: Distributed clusters of VMs and rollout workers enable high-throughput sample generation and online policy updates (Li et al., 28 Sep 2025, Ye et al., 21 Aug 2025).
- Agent Efficiency Constraints: Step caps, action syntax checking, and per-trajectory validation enforce policy tractability and safety.

6. GUI Grounding, Data Synthesis, and Tool Augmentation

OSWorld-G and GUI Grounding: To address grounding bottlenecks, OSWorld-G provides finely annotated examples across diverse UI elements and capabilities. Specialized models (e.g., Jedi-trained Qwen2.5-VL) demonstrate 22-point success rate gains, highlighting grounding as the principal failure mode for LLM agents (Xie et al., 19 May 2025, Yang et al., 8 Jul 2025).
Synthetic Data Generation: Large-scale synthetic grounding datasets (Jedi: 4 million samples) are generated from icon mining, code-and-render pipelines, and layout exports, enabling robust compositional generalization and transfer to unseen UIs (Xie et al., 19 May 2025).
MCP Tool Integration: OSWorld-MCP establishes a standardized protocol for agent-driven tool invocation, dramatically reducing necessary action steps on tool-beneficial tasks. Automated code generation and manual curation yield 158 high-quality tools across 7 desktop domains. Effective use of MCP tools increases task accuracy by up to 18 percentage points in weak baseline agents (Jia et al., 28 Oct 2025).

7. Limitations, Outstanding Issues, and Future Directions

Agent Efficiency: Current generation agents remain 40–170% less step-efficient than humans, with LLM-based planning and reflection accounting for up to 94% of latency. Per-step execution times grow due to prompt length inflation, imposing a practical ceiling on real-world deployment (Abhyankar et al., 19 Jun 2025).
Long-Horizon and Multi-App Workflows: Agents are challenged by tasks exceeding 10–100 GUI operations, requiring persistent memory, adaptive recovery, and robust exploration.
Grounding Failures: Residual grounding errors, especially in compositional, fine-grained, or visually ambiguous UIs, are persistent across architectures. Combining multiple grounding data perspectives and architectural innovations remains necessary (Xie et al., 19 May 2025).
Tool Selection and Reasoning in OSWorld-MCP: Even when exposed to MCP tools, agents frequently fail to recognize applicability or select among similar tool affordances. Tool invocation rates remain low (≤36.3%), suggesting a need for improved retrieval, composition, and intention recognition (Jia et al., 28 Oct 2025).
Evaluation and Benchmarking: OSWorld explicitly enforces zero-shot, execution-based evaluation. Extensions are ongoing into mobile, web, and specialized science/computing domains, with open-source resources available for reproducibility across research groups (Yan et al., 17 Dec 2025, Ye et al., 21 Aug 2025).

OsWorld, through its unified, real-OS execution environment, diverse task distribution, and support for both atomic and tool-level agent operation, has become the de facto standard for rigorous multimodal agent research in the computer-use domain, enabling measurement and analysis of generalist artificial agents on par with—or exceeding—human capability under controlled, reproducible conditions. Continued progress in GUI grounding, tool reasoning, and sample-efficient learning will determine the attainable trajectory toward practical, robust computer-use assistants (Xie et al., 2024, Abhyankar et al., 19 Jun 2025, Andreux et al., 22 Oct 2025, Xie et al., 19 May 2025, Jia et al., 28 Oct 2025, Yan et al., 17 Dec 2025).