OSWorld-Human: Human Efficiency Benchmark
- OSWorld-Human is a benchmark that defines human-like efficiency by using precise, annotated minimal trajectories to evaluate computer-use agents on both task success and action planning.
- The framework is built on a meticulous two-pass manual annotation process, grouping actions to reflect rapid human performance and establish a clear efficiency reference.
- It highlights that modern agents, despite high accuracy, incur significant latency due to extensive planning calls, emphasizing the need for design strategies that balance success with temporal efficiency.
OSWorld-Human is a benchmark and evaluation framework grounded in the OSWorld environment, specifically developed to assess the efficiency and human-likeness of computer-use agents. Unlike prior benchmarks that emphasized only the eventual success or accuracy of task completion, OSWorld-Human introduces human-annotated gold-standard trajectories for each computer-use task, enabling rigorous temporal and action-level efficiency evaluation. This paradigm shift exposes not only whether agents can solve a given task but also how closely their action planning, step throughput, and overall temporal efficiency approach those of experienced human users.
1. Rationale and Scope
The motivation for OSWorld-Human originates from the observation that contemporary generative AI agents, even those achieving high accuracy on OSWorld, exhibit excessive end-to-end latency, often taking tens of minutes to complete tasks that require humans only a few minutes. The primary finding is that large model calls for planning and reflection dominate agent latency, and as task execution progresses, each additional step tends to incur compounding delays—later steps can take up to three times longer than early ones.
OSWorld-Human addresses this by providing a precise human-efficiency reference point: for every OSWorld task (n=369), a human-minimal trajectory is meticulously annotated. This enables time- and step-efficiency measurements, guiding future system designs toward not only correctness but deployable performance.
2. Construction and Annotation Methodology
The OSWorld-Human dataset consists of human-determined minimal trajectories mapped onto the OSWorld task suite. Creation of these reference trajectories involves:
- Manual annotation by computer science graduate students working in two independent passes, with subsequent cross-validation for consensus.
- Reference to user-submitted instructions and gold files where available, complemented by direct task execution within the OSWorld virtual machine (VM) for verification.
- Grouping of actions is also formalized: besides single-action trajectories (each discrete action is counted), “grouped-action trajectories” aggregate consecutive actions that a human could perform under unchanged contextual observation. For example, entering text into a field and pressing enter are treated as a single grouped action if performed without new observations, reflecting the compositionality and minimal context-switching required by expert users.
This explicit ground-truthing process ensures that each step in the trajectory is both necessary and sufficient for task completion, establishing a rigorous lower bound for agent evaluation.
3. Efficiency Evaluation and Benchmarking Results
The benchmark is applied to 16 leading computer-use agents, whose performances were previously compared only via traditional OSWorld accuracy metrics. OSWorld-Human evaluates not only success rates but also action and plan efficiency.
Key findings include:
- Even the top-performing agents on OSWorld require, on average, 1.4 to 2.7 times as many steps as the human trajectory to solve a given task.
- Efficiency is quantified using the Weighted Efficiency Score (WES), which penalizes agents for using more steps than the human baseline and deducts further for failed completions.
- Comparative studies reveal that agents with similar or higher success rates may differ dramatically in step efficiency. For example, Table 1 in the paper shows that in LibreOffice Calc, the human trajectory averages 13.6 steps (single-action) or 5.9 steps (grouped-action), a level of conciseness that no agent yet matches.
A summary table for average steps per trajectory (by application) is:
| Application | Single-Action (Human) | Grouped-Action (Human) |
|---|---|---|
| OS | 4.9 | 3.8 |
| Thunderbird | 9.6 | 8.8 |
| VS Code | 6.3 | 5.1 |
| LibreOffice Writer | 9.0 | 6.1 |
| VLC | 6.3 | 4.8 |
| GIMP | 4.6 | 3.2 |
| LibreOffice Impress | 8.5 | 4.5 |
| Chrome | 6.8 | 5.0 |
| LibreOffice Calc | 13.6 | 5.9 |
The Weighted Efficiency Score (WES), defined as:
- For successes ():
- For failures ():
where is the expected (human) step count, is the agent’s step count, and is a predefined maximum cutoff, is used for cross-system evaluation.
4. Temporal Performance and Latency Analysis
The investigation establishes that most of the latency in agent operation stems from large model calls used in stepwise planning and trajectory reflection. At each step, the system’s prompt accumulates the entire prior action history (for context conditioning), causing the token length—and therefore prefill time—of LLM calls to increase monotonically over the trajectory. Later actions thus suffer disproportionately higher latency.
- Empirical timing data indicate that LLM planning/reflection contribute between 75% and 94% of the total latency for most tasks.
- For complex applications like LibreOffice, generation of accessibility trees (A11y) and longer multimodal prompts further inflate per-step delays, with some A11y trees requiring 3 to 26 seconds to produce.
Consequently, the excessive number of steps not only impairs efficiency but also magnifies per-step computational cost, making such agents impractical for deployment in time-sensitive environments.
5. Implications for Agent Design and Future Research
OSWorld-Human analysis directs future research toward optimizing not just the accuracy, but also the efficiency of computer-use agents. Recommendations include:
- Reducing the number and length of context tokens per LLM inference to minimize per-step latency.
- Grouping multiple elementary actions into batches that can be executed with a single planning/reflection pass (mirroring the grouped-action human trajectories).
- Streamlining the planning and reflection pipeline to delay or minimize the need for full-history context, reducing wall-clock runtime.
- Using the public OSWorld-Human gold trajectories and the WES metric as target baselines for both model evaluation and reward shaping in RL or imitation learning settings.
A direct implication is that agent development strategies should explicitly balance success rate and temporal efficiency, as high-accuracy agents with excessive planning/replanning quickly become unusable in real-world applications.
6. Broader Context and Significance
OSWorld-Human reorients the field towards practical, deployable intelligent systems by making efficiency explicit in agent evaluation. Its dual emphasis on action optimality and temporal responsiveness exposes the limitations of viewing task completion as a binary outcome and establishes new, more stringent criteria for progress.
Human-level efficiency, rather than maximal model accuracy, is identified as the main unresolved challenge for next-generation computer-use agents. The benchmark provides the necessary infrastructure and metrics for quantifying and ultimately bridging the efficiency gap between AI agents and skilled human users.
7. Conclusions
By integrating human-minimal, validated trajectories and an efficiency-centric evaluation methodology, OSWorld-Human constitutes a foundational advance in benchmarking human-like interaction and efficiency for computer-use agents. This orientation is central for the development of agents that are competent not just in “what” they achieve but “how” they achieve it—setting the research agenda for efficient, contextually aware automation in GUI-based computing environments (Abhyankar et al., 19 Jun 2025).