Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents (2505.03570v1)

Published 6 May 2025 in cs.AI

Abstract: In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring of progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon. The source code of the benchmark is available at https://github.com/agentsea/osuniverse.

Summary

  • The paper introduces OSUniverse, a new comprehensive benchmark designed to evaluate multimodal AI agents navigating complex desktop GUI tasks, highlighting the significant gap between human and current SOTA AI performance.
  • OSUniverse features varied task complexities, support for diverse agent architectures, extensive coverage of real-world scenarios including multi-application workflows, and a robust automated validation system.
  • Experimental results using the benchmark show that even leading GUI navigation agents currently achieve success rates below 50%, indicating substantial opportunity for improvement in agent architectures and model training.

OSUniverse: A Comprehensive Benchmark for GUI-Navigation AI Agents

The paper "OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents" authored by Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo Márquez Flores, and Sinéad Ryan, introduces OSUniverse, a novel benchmark devised for evaluating advanced GUI-navigation AI agents. The benchmark focuses on complex, multimodal desktop-oriented tasks, with a strong emphasis on layers of complexity, ease of extensibility, comprehensive test coverage, and automated validation. OSUniverse aims to fill the existing gap in benchmarks designed to evaluate GUI-navigation agents' capacity for real-world task performance.

Benchmark Design and Structure

OSUniverse categorizes tasks into varying complexity levels, ranging from simple precision clicking to intricate multi-step, multi-application challenges requiring dexterity, precision, and cognitive reasoning. The benchmark is calibrated to ensure that state-of-the-art (SOTA) agents do not exceed a 50% success rate, whereas an average white-collar worker demonstrates perfect task accuracy. This calibration suggests a substantial difference between human and AI performance, highlighting the complexities involved in GUI navigation tasks.

The benchmark is designed to be both manually scored and automatically evaluated, with the latter employing a validation mechanism characterized by an error rate of less than 2%. This low error rate endorses the robustness and reliability of the benchmark's automated scoring system.

Comparison with Existing Benchmarks

OSUniverse addresses several shortcomings present in existing benchmarks like WebShop, Mind2Web, and OSWorld. These benchmarks primarily focus on browser-based activities and often fail to capture the nuanced difficulty inherent in real-world GUI interactions. For instance, OSWorld, while notable for its difficulty, has significant limitations in setup challenges, limitations to ReACT-style agents, deterministic validation, and vague task prompts.

OSUniverse tackles these issues by providing:

  • Flexibility: It supports varied environment configurations, action spaces, agent architectures, models, runtimes, and validation processes.
  • Comprehensive Coverage: It incorporates tasks that require reasoning, visual perception, multi-app workflows, and procedural knowledge, thus mirroring the complexity encountered in real-world scenarios.
  • Versatility: The benchmark is open to diverse agent models and architectures, encouraging innovation and allowing researchers to explore different frameworks.

Experimental Evaluation and Insights

The benchmark test results reveal a stark reality: even the most advanced GUI navigation agents currently witness scores below 50% across the benchmark, with substantial room for improvement. Notably, the Computer Use Preview agent from OpenAI highlights both formidable potential for GUI navigation tasks and significant variability in performance. These observations suggest ample opportunity for further development in both proprietary and open-weight models, particularly with respect to architectural and model-specific improvements.

The results underscore the prominent role of models specifically trained for GUI navigation in achieving superior performance. For instance, models like Claude Computer Use and QWEN 2.5 VL 72B demonstrate commendable generalization capabilities, though their performance markedly improves when executed within their native action spaces.

Future Directions and Implications

The paper concludes by indicating future avenues for benchmarking and agent development:

  • Extension of Task Complexity: Expanding the benchmarks within the Silver and Gold categories to sustain a challenging environment for the evolving SOTA models.
  • Diversity in Application and Test Cases: Beyond existing applications and use cases, an enriched set including account-specific tasks can be integrated to broaden benchmark challenges.
  • Advanced Agentic Implementations: Improvement scope in agent architecture using reinforced learning models, enhanced memory utilization, and rigorous prompt engineering.

The OSUniverse benchmark represents a pivotal step towards developing and standardizing assessments for GUI-navigation AI agents. Its comprehensive design and diverse evaluation metrics contribute to evolving AI research in multimodal agent systems, emphasizing the nuanced complexities inherent in real-world human-machine interactions. The source code availability encourages community contribution and collaborative enhancement, fostering continuous progress in the discipline.

X Twitter Logo Streamline Icon: https://streamlinehq.com