Papers
Topics
Authors
Recent
Search
2000 character limit reached

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Published 30 Apr 2026 in cs.AI and cs.CL | (2604.27776v1)

Abstract: While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.

Summary

  • The paper introduces WindowsWorld, a benchmark with 181 tasks across 17 apps to evaluate autonomous GUI agents in realistic cross-application workflows.
  • It employs a human-in-the-loop pipeline and checkpoint-based scoring to assess intermediate progress and final task success.
  • Experimental findings reveal steep performance drops in multi-application, long-horizon tasks, highlighting challenges in context switching and error recovery.

WindowsWorld: A Process-Centric Benchmark for Cross-Application Autonomous GUI Agents

Motivation and Benchmark Design

WindowsWorld addresses a central deficiency of existing GUI agent benchmarks: the absence of systematic, process-centric evaluation of autonomous agents on realistic, professional multi-application workflows in desktop environments. Prior benchmarks, such as OSWorld and Windows Agent Arena, primarily focus on single-application tasks or short-horizon workflows, leading to inflated evaluation scores that obscure agents' deficiencies in context switching, cross-application information integration, and long-horizon planning. WindowsWorld explicitly targets these gaps by constructing a benchmark of 181 professionally grounded tasks spanning 17 desktop applications and 16 personas across five occupational categories. The task suite is structured into four difficulty levels, including deliberately infeasible instructions for negative constraint evaluation. Approximately 78% of tasks require multi-application coordination, a stark contrast to the distribution in prior works. Figure 1

Figure 1: Left, distribution of applications and task counts; right, breakdown of 16 personas across 5 major professional domains.

Benchmark construction leverages a human-in-the-loop multi-agent pipeline. Task generation is seeded on persona-conditioned routines and application dependencies, refined by deduplication and semantic assurance modules, and culminates in human verification for instruction quality and ambiguity reduction. Task environments are synthesized via LLM-driven file scaffolding, ensuring all dependenciesโ€”like working documents, datasets, and download linksโ€”are programmatically provided. Figure 2

Figure 2: The overall architecture of the human-in-the-loop multi-agent pipeline for benchmark generation.

Task Taxonomy, Persona Specialization, and Evaluation Protocol

Tasks are classified on two axesโ€”complexity and persona. L1 covers atomic single-app manipulations; L2 tasks require sequential multi-app workflows; L3 introduces dynamic, conditional reasoning across apps; L4 consists of infeasible or adversarial instructions. Each persona (e.g., Accountant, Creative Director, Software Engineer) grounds task content in realistic professional activity, further diversifying the operational context and UI interaction demands.

The evaluation protocol is process-centric. For each task, fine-grained intermediate checkpoints are defined and manually validated as path-essential, rather than trajectory-specific. This checkpoint-based assessment departs from the prevalent โ€œall-or-nothingโ€ final-state scoring, enabling reward for meaningful sub-goal achievement and supporting analysis of failure localization in long-horizon workflows.

Analysis of Benchmark Composition

WindowsWorld tasks exhibit significant diversity in both horizon length and application co-occurrence. Feasible tasks (L1โ€“L3) require on average 5 sub-goals; the majority (71.8%) are multi-step and 77.9% engage at least two applications. Even after controlling for length by step-matching short L2 workflows to L1, cross-application tasks remain significantly more challenging, underscoring that the primary bottleneck lies in state transfer and context maintenance rather than step count alone.

Process-aware evaluationโ€”facilitated by the integration of an LLM-based judge (Qwen3-VL-Plus)โ€”demonstrates high reliability, with Pearson correlation scores exceeding 0.91 relative to human judgment on intermediate checkpoints. This enables robust automated scoring and facilities studies of both sub-goal and final-state competence.

Experimental Findings and Model Behavior

A spectrum of SOTA large multimodal models (e.g., Gemini-3-flash/pro, GPT-5.2, Claude-Sonnet 4.5, Qwen3-VL-Plus) and agent frameworks (Agent-S3, UiPath) are benchmarked. Performance is ablated across vision-only, set-of-mark (SoM), and hybrid (screenshot + accessibility tree) input modalities. Key results are as follows:

  • Steep Drop in Multi-Application Performance: All evaluated models exhibit sharp decay in both intermediate and final-state scores with increasing task complexity. For Gemini-3-flash (Hybrid), SintS_{\text{int}} decreases from 67.5% (L1) to 38.6% (L3); SfinalS_{\text{final}} plummets from 35.9% (L1) to 16.7% (L3).
  • Completion vs. Progress Gap: Even top-performing models like Gemini-3-flash (Hybrid) achieve only 20.4% terminal success rate across all task types, with up to a 30% gap between intermediate checkpoint completion and final task accomplishment.
  • Failure on Negative Constraints and Infeasible Tasks: Agents perform unreliably on impossible/infeasible instructions (L4); e.g., GPT-5.2 (SoM) succeeds only 25% of the time in correctly rejecting unachievable tasks, indicating substantial hallucination risk.
  • Independent Modality Impact: Structured hybrid inputs (visual + accessibility) improve grounding, while SoM overlays provide inconsistent benefit and sometimes degrade performance relative to raw screenshots, especially in coordinate localization.
  • Persona-Dependent Outcomes: Analysis by persona category exposes a pronounced long-tail effect, with productivity and creative personas exhibiting larger progressโ€“completion gaps attributable to global consistency challenges in document- and media-centric workflows.

Failure Modes and Trajectory Visualization

Detailed trajectory visualizations illustrate common error classes: (i) unreliable window focus, leading to spurious clipboard actions; (ii) pyautogui failures due to language/IME conflicts; and (iii) failures in content transfer across disparate app UI paradigms. These are typified by inefficient โ€œdriftingโ€ executionsโ€”trajectories with extensive sub-goal progress but ultimate failure in global goal integration. Figure 3

Figure 3: Example execution trace of a successful L1 multi-app task, showing agent GUI observation, action sequence, and state transitions.

Figure 4

Figure 4: Illustration of stochastic GUI failures due to IME pop-up interference and input misalignment in automation.

Figure 5

Figure 5: Failure in application switching, resulting in synchronization errors caused by improper window focus for clipboard operations.

Implications, Limitations, and Future Directions

WindowsWorldโ€™s process-aware, persona-grounded, multi-application benchmark directly exposes the limitations of contemporary GUI agents and multimodal foundation models. Practically, the dominant barriers are not local semantic parsing or atomic action prediction, but context switching, multi-app state management, and consistent long-horizon planning. The persistent gap in checkpoint vs. global completion metrics indicates lack of agent โ€œself-awarenessโ€ and robust error recovery in the face of stochastic GUI states (e.g., IME popups, environmental anomalies).

Theoretically, the benchmark implies that further scaling of model capacity alone is unlikely to close the observed performance deficits. Rather, advances in explicit state tracking, GUI context abstraction, and robust multi-modal policy architectures are required. Additionally, process-centric evaluation frameworks as introduced here should become standard for benchmarks targeting real-world agent deployment in digital productivity environments.

Anticipated future research catalyzed by WindowsWorld includes:

  • Curriculum-based or RL fine-tuning on process-aware long-horizon tasks;
  • Formal incorporation of Model Context Protocols for standardized tool access;
  • Extension to multilingual OS environments and more expressive action modeling;
  • Exploration of meta-agent architectures for more effective cross-application coordination.

Conclusion

WindowsWorld constitutes a thorough and challenging benchmark for process-aware evaluation of autonomous GUI agents in professional, multi-application desktop environments (2604.27776). By systematically revealing the limitations of current agentic modelsโ€”particularly in cross-application coordination, context retention, and global consistencyโ€”it provides a critical step toward the realization and assessment of more robust, general-purpose AI agents for real-world computer use.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.