Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point (2502.08047v3)

Published 12 Feb 2025 in cs.AI and cs.MA

Abstract: GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to the sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state, often lead to planning errors. This issue is widespread in real application scenarios, but existing benchmarks fail to evaluate it. To address this gap, we introduce WorldGUI, a comprehensive GUI benchmark containing tasks across ten widely used desktop and web applications (e.g., PowerPoint, VSCode, Acrobat), each instantiated with diverse initial states to simulate authentic human-computer interactions. Complementing this, we propose WorldGUI-Agent, a universal framework that unifies three core modules: Planner-Critic for high-level plan refinement, Step-Check for intermediate verification, and Actor-Critic for action-level optimization to proactively detect and correct errors. Experimental evaluation shows that WorldGUI-Agent outperforms the outstanding existing model (Claude-3.5 Computer Use) by 12.4% in success rate on WorldGUI, and achieves a 31.2% overall success rate on WindowsAgentArena, surpassing the prior state-of-the-art by 11.7%. Our analysis further reveals that dynamic augmentation tasks and desktop environments pose substantial hurdles, underscoring the necessity of adaptive planning and feedback-driven execution for advancing real-world GUI automation. The code and data are available at https://github.com/showlab/WorldGUI.

Summary

  • The paper introduces WorldGUI as an interactive benchmark that challenges GUI automation agents with diverse initial states.
  • It leverages the GUI-Thinker framework, which enhances planning and adaptability through iterative self-assessment components.
  • Empirical results show a 14.9% improvement in success rate over Claude-3.5, underscoring its practical impact on GUI automation evaluation.

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

This paper presents an advanced benchmark called WorldGUI, aimed at evaluating the performance of graphical user interface (GUI) automation agents. It targets the persistent challenge of dynamic initial states that significantly impact the planning abilities of GUI agents. Existing efforts have mainly focused on static scenarios, thus falling short of assessing agents' capabilities in real-world fluctuating environments. WorldGUI addresses this gap, offering a robust framework that not only imitates real application conditions but also pushes the boundaries of GUI agent evaluation.

WorldGUI introduces a set of diverse tasks compiled from 10 broadly utilised desktop software applications, including PowerPoint, Excel, and VSCode. The benchmark encapsulates 315 tasks, each enriched with various initial states, testing agent adaptability to dynamic environments. The benchmark design encompasses scenarios where applications do not start from a default state, users interrupt tasks at any stage, or interfaces present unique conditions requiring on-the-fly decision-making by the agent. This marks a significant advancement over static benchmarks like OSWorld and WebArena, which do not sufficiently capture the dynamic nature of GUI tasks.

A pivotal part of the paper is the introduction of the GUI-Thinker framework. Unlike its predecessors, GUI-Thinker emphasizes critical thinking through three core components: Post-Planning Critique, Pre-Execution Validation, and Post-Action Evaluation. These components collectively enhance the agent's adaptability in handling tasks dynamically by self-verifying and iteratively optimizing each plan and action. Notably, experimental results attest to this framework’s efficacy. GUI-Thinker outperformed the Claude-3.5 model by a margin of 14.9% in success rate on WorldGUI tasks, thereby demonstrating its sophisticated planning and adaptability in complex environments.

The implications of this research are multifaceted. Practically, WorldGUI sets a new standard for evaluating GUI agents, encouraging future improvements in robustness and versatility. Theoretically, it redefines the constraints within which GUI agents operate, expanding our understanding of intelligent agent design and implementation.

Looking forward, this research suggests a trajectory for exploring more nuanced aspects of AI-human interactions in computer environments. Future work might focus on more personalized task adaptations, leveraging user-specific data to further increase automation efficiency. Innovations in Multimodal LLMs, like the ones underpinning GUI-Thinker, hold promise for breaking new ground in AI capabilities, integrating critical thinking, perception, and decision-making across diverse task domains.

Overall, this paper introduces an essential step toward achieving comprehensive assessment and improvement of GUI automation agents. Through WorldGUI and GUI-Thinker, it harnesses the potential of critical thinking in AI, which can significantly contribute to the continuing evolution of AI technologies.