OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents (2506.16042v1)

Published 19 Jun 2025 in cs.AI, cs.LG, and cs.OS

Abstract: Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning and reflection account for the majority of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld-Human and found that even the highest-scoring agents on OSWorld take 1.4-2.7x more steps than necessary.

Summary

The paper introduces a benchmark that reveals LLM-based planning as the primary source of latency, accounting for up to 94% of total task delay.
It demonstrates that computer-use agents require 1.4× to 2.7× more steps than human trajectories, highlighting significant inefficiencies.
The study establishes the OSWorld-Human dataset and a new Weighted Efficiency Score to better evaluate agents for real-world applications.

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

OSWorld-Human introduces a pivotal framework for evaluating and advancing computer-use agents (CUAs) by systematically benchmarking not just their accuracy but, for the first time, their temporal efficiency. This work addresses a critical gap in the assessment of CUAs, whose usability in practical deployments remains severely limited by high end-to-end task latency, despite notable advances in completion rates across established benchmarks.

Core Contributions

The paper offers several concrete contributions:

Latency Analysis of State-of-the-Art CUAs: Through an in-depth breakdown using Agent S2 as a representative agentic system, the authors pinpoint that over 75%–94% of agent task latency arises from step-wise LLM-based planning and reflection. Action execution and environment perception are negligible by comparison.
Empirical Demonstration of Step Inefficiency: CUAs consistently take between 1.4× and 2.7× more steps than necessary, as determined by human trajectories, highlighting a fundamental inefficiency in current agent designs even among the highest-performing models.
Construction of OSWorld-Human Dataset: The OSWorld-Human resource provides manually curated, minimal human trajectories (both per-action and grouped-action) for all 369 tasks in the OSWorld benchmark. This dataset serves as a new empirical gold standard for evaluating efficiency.
Introduction of Weighted Efficiency Score (WES): The WES metric integrates both task success and step efficiency, providing a more nuanced view of agent performance that explicitly penalizes both over-long solutions and inefficient failures.
Systematic Evaluation Across 16 Agents: The evaluation covers a spectrum of systems spanning various perception modalities (screenshot, accessibility tree, set-of-marks) and LLMs, revealing a substantial decline in efficiency-adjusted performance compared to original success rates.

Key Technical Insights

Planning and Reflection as Latency Bottlenecks:

For agentic systems that incrementally plan and reflect based on LLM calls, both context length and call frequency grow with trajectory length, dramatically increasing latency per step. This iterative architecture, common across leading systems, induces compounding delays.

Promoting Action Grouping:

Human trajectories can often combine several actions based on a single observation. The authors formalize this through grouped-action representations, highlighting that batching actions (when feasible) can reduce LLM invocations and therefore aggregate latency – suggesting a promising direction for architectural improvements.

Modality Trade-offs:

Adding structured UI representations (e.g., accessibility trees) can decrease step count in some tasks but often at the cost of increased prompt length and total latency, particularly in visually complex applications. The utility of these modalities is highly application- and task-dependent.

Metrics Beyond Accuracy:

Pure success rate is an inadequate proxy for deployability. The marked drop in WES $^+$ relative to raw success rates underscores the necessity of considering efficiency for any agent intended for real-world use.

Notable Results

The following table condenses central empirical outcomes:

Agent/Modality	OSWorld Success Rate (%)	Single-Action WES $^+$ (%)	Grouped-Action WES $^+$ (%)	WES $^-$
UI-TARS-1.5 (100)	42.5	23.7	14.3	-0.22
Agent S2 + Gemini 2.5 (50)	41.4	28.2	17.4	-0.26
Agent S2 + Claude 3.7 (50)	34.5	20.0	11.4	-0.42

Even the most successful agents, when scrutinized for efficiency against human trajectories, perform at under half their reported accuracy—demonstrating significant room for improvement in real-world usability.

Implications and Future Directions

Practical Deployment:

For CUAs to enable true real-time digital assistance or accessibility, both step count and per-step computation must be minimized. The findings here motivate algorithmic improvements in planning (e.g., more aggressive action grouping) and prompt/LLM optimization (e.g., succinct memory retrieval, sparse history usage).

Architectural Optimizations:

Promoting architectures that can reason about and commit to multiple local steps per observation—potentially by leveraging explicit UI state modeling, higher-level schemas, or reinforcement learning to learn efficient macro-actions—could yield substantial reductions in both latency and API cost.

Dataset and Benchmark Utility:

OSWorld-Human’s rigorous human-annotated trajectories offer a valuable testbed for benchmarking next-generation agent designs both for efficiency and for human-likeness in digital task execution. The WES metric, by capturing both task success and step efficiency, is a strong candidate for future leaderboard adoption and for comprehensive evaluation in production settings.

Potential Research Directions:

End-to-end optimization of agentic pipelines for latency under success constraints;
Meta-learning or search-based strategies for dynamic action grouping;
Application of retrieval-augmented or condensed-prompt strategies to minimize LLM prefill cost;
Investigation of specialized, smaller, lower-latency models for certain planning/reflection subtasks.

Conclusion

By exposing and quantifying the fundamental inefficiencies of contemporary agentic CUAs, OSWorld-Human sets the stage for efficiency-centric research and evaluation. Its methodologies, datasets, and metrics provide a clear path toward agents that are not only competent but also deployable in interactive and time-sensitive domains. The release of OSWorld-Human and accompanying analysis is poised to shape both algorithmic design and benchmarking practice in the evolving space of computer-use AI agents.

PDF Markdown