WindowsAgentArena Benchmark

Updated 12 May 2026

WindowsAgentArena is an open benchmark that evaluates autonomous agents on Windows desktops using visual feedback and precise GUI actions.
It simulates realistic, multi-application workflows—including Office suites, web browsing, and system configuration—on authentic cloud-hosted Windows VMs.
WAA employs a POMDP formulation with defined action budgets and success metrics to drive innovations in robust, long-horizon desktop automation.

WindowsAgentArena (WAA) is a large-scale, open benchmark designed to rigorously evaluate autonomous computer-use agents operating within the Microsoft Windows operating system. Established as a counterpart to prior Linux-centric benchmarks such as OSWorld, WAA emphasizes multimodal, long-horizon GUI automation across a spectrum of realistic desktop tasks, leveraging raw desktop screenshots as its primary observation modality. WAA is now a principal yardstick for gauging the agentic capabilities of systems that seek to generalize human-like interaction patterns across diverse Windows applications and workflows (Bonatti et al., 2024, Gonzalez-Pumariega et al., 2 Oct 2025, Han et al., 23 Apr 2026).

1. Design Goals, Scope, and Motivation

WindowsAgentArena was created to address deficiencies in earlier computer-use benchmarks, which either focused on web navigation, text-based interfaces, or non-Windows environments. Its chief motivation is to simulate the breadth and heterogeneity of real-world Windows desktop usage at scale, including the interplay of multiple applications, nested dialogs, and visually intricate workflows such as document editing, code authoring, web browsing, system configuration, and multimedia manipulation (Bonatti et al., 2024, Gonzalez-Pumariega et al., 2 Oct 2025).

Key properties:

Breadth: 154 tasks spanning Office suites (LibreOffice Writer/Calc), Web browsers (Chrome, Edge), File Explorer, Settings, VS Code, VLC, and utilities like Notepad, Calculator, Clock, Paint.
Multimodality: Tasks require agents to operate exclusively via visual feedback (RGB screenshots) and GUI interaction primitives. No DOM or accessibility tree is exposed, except in specific agent implementations tested.
Realism: Uses full-featured cloud-hosted Windows VMs with genuine OS and application installations, not emulated or HTML-reconstructed UIs.
Reproducibility and Scalability: Evaluations are parallelizable on public cloud (e.g., Azure, AWS), using containerized VMs with per-task snapshotting for consistent environments (Bonatti et al., 2024).

2. Task Structure, State and Action Spaces

WAA formalizes each interaction episode as a partially observable Markov decision process (POMDP) (Agashe et al., 1 Apr 2025, Gonzalez-Pumariega et al., 2 Oct 2025):

State ( $s_t$ ): The comprehensive desktop configuration at time $t$ , including app windows, filesystem state, and open dialogs.
Observation ( $o_t$ ): A pixel-level screenshot of the entire desktop.
History ( $h_t$ ): Sequence $(o_0, a_0, o_1, ..., o_t)$ , accessible to the agent.
Action Space: Composed of atomic GUI actions:
- Mouse primitives: click(x,y), double_click(x,y), drag_to, scroll.
- Keyboard input: type(text), press_key, hotkey.
- Composite/utility actions: call_code_agent() for direct code execution (in some agent variants), call_search_agent.
- Task flow: agent.done(), agent.fail().
Constraints: Step budgets of 15 (original), 30 (V2), 50, or 100 actions per task, with stricter limits for more challenging settings (Agashe et al., 1 Apr 2025, He et al., 20 May 2025, Gonzalez-Pumariega et al., 2 Oct 2025, Han et al., 23 Apr 2026).

Tasks are specified by natural language instructions. The diversity and complexity of tasks range from simple file renames to multi-application workflows and manipulations involving dynamic visual states (e.g., delayed UI feedback, transient dialogs) (Bonatti et al., 2024, Gonzalez-Pumariega et al., 2 Oct 2025).

3. Evaluation Criteria and Protocol

The primary metric is Success Rate (SR), a normalized binary measure: $\mathrm{SR} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{1}[\text{task}_i \text{ succeeded}]$ where a task is marked as succeeded only if the post-episode WM state matches an executable, hand-written evaluator script for that task. The typical protocol involves:

Each task evaluated from a clean snapshot, ensuring independence and reproducibility (He et al., 20 May 2025, Gonzalez-Pumariega et al., 2 Oct 2025).
Agents must complete tasks within the step budget; exceeding the budget or issuing agent.fail() results in failure.
No agent receives demonstration or task-specific training for WAA; zero-shot transfer is emphasized (Agashe et al., 2024).

Recent enhancements (WindowsAgentArena-V2) enforce strict task feasibility, snapshot resets per task, and human-in-the-loop verification for nuanced completion criteria, eliminating infeasible or ambiguous objectives (He et al., 20 May 2025).

4. Agent Architectures and State-of-the-Art Performance

WindowsAgentArena has catalyzed several lines of agent research. The following table summarizes success rates by leading approaches (max 50/100 steps) (Han et al., 23 Apr 2026, Gonzalez-Pumariega et al., 2 Oct 2025, Wang et al., 2 Sep 2025, Agashe et al., 1 Apr 2025, Agashe et al., 2024, Bonatti et al., 2024):

Method	15–30 Steps	50 Steps	100 Steps
NAVI (baseline)	19.5%	–	–
Agent S	18.2%	–	–
Agent S2	29.8%	–	–
UI-TARS-1.5	–	42.1%	–
UI-TARS-2	–	50.6%	–
Agent S3 (no scal.)	–	49.0%	50.2%
bBoN (Agent S3/GPT5)	–	54.1%	56.6%
VLAA-GUI	–	60.4%	61.0%
Human	–	74.5%	–

Over time, SR has improved from ∼19% (NAVI) to over 60% (VLAA-GUI), with recent systems leveraging:

Hierarchical modular planning and compositional architectures (Agent S2/S3) (Agashe et al., 1 Apr 2025, Gonzalez-Pumariega et al., 2 Oct 2025).
Multi-turn RL with value pretraining and enhanced PPO objectives (UI-TARS-2) (Wang et al., 2 Sep 2025).
Rollout scaling and trajectory selection via Behavior Best-of-N (bBoN) using behavior narratives and VLM-based comparative judges (Gonzalez-Pumariega et al., 2 Oct 2025).
Modular “STOP–RECOVER–SEARCH” frameworks, incorporating mandatory visual verifiers, loop-breaking strategies, and on-demand search/coding modules (VLAA-GUI) (Han et al., 23 Apr 2026).

VLAA-GUI (Gemini 3 Flash backbone) currently sets the highest published SR at 60.4% (50-step) and 61.0% (100-step), substantially outperforming both previously dominant methods and all earlier baselines (Han et al., 23 Apr 2026).

5. Key Methodological Innovations and Failure Modes

Best-of-N Rollout with Vision-Language Judging (bBoN): Generates $N$ stochastic trajectories per task and, for each, summarizes the action-effect sequence (“behavior narrative”). A VLM judge compares these in MCQ format, directly selecting the most successful trajectory, yielding significant robustness in the face of stochasticity and compounding errors (Gonzalez-Pumariega et al., 2 Oct 2025).
Hierarchical and Specialist Agents: Architectures such as Agent S2 decompose high-level tasks into dynamically replanned subgoals, using a mixture of grounding strategies to localize GUI elements precisely, which is essential in visually complex, multi-app or multi-dialog workflows (Agashe et al., 1 Apr 2025).
Reinforcement Learning and Value Shaping: UI-TARS-2 employs decoupled-GAE, length-adaptive GAE, and PPO with value pretraining, which empirically stabilizes policy learning across sparse rewards and long-horizon tasks, resolving challenges in stale window focus or over-scrolling (Wang et al., 2 Sep 2025).
Modular Action Control (VLAA-GUI): Integrates mandatory Completeness Verifier (for visual evidence at each agent.done), Loop Breaker (switches interaction modes or resets on loops), and on-demand Search Agent to mitigate premature stopping and repetitive failures. Ablations indicate each module contributes 5–11pp SR improvement, with greatest gains in Web/System categories (Han et al., 23 Apr 2026).

Dominant failure modes across agents include: menu mis-grounding, failure to reorient in multi-dialog settings, excessive click loops, and incomplete verification of visually subtle completion cues.

6. Comparative Analysis with Other Benchmarks

WAA is distinct from other benchmarks in several respects:

OSWorld is primarily Ubuntu-based, though many of its tasks are re-expressed in WAA with Windows-specific adaptations (PowerShell for shell, Edge/Chrome integration, UIA trees). Unlike OSWorld, WAA covers additional Windows-only scenarios (ribbon menus, proprietary Office workflows) (Bonatti et al., 2024, Agashe et al., 1 Apr 2025).
AndroidWorld and Mind2Web address mobile or browser-level automation, not the desktop-wide agentic setting of WAA (Agashe et al., 1 Apr 2025).
WAA uniquely supports direct binary script-based evaluators, enabling reinforcement learning paradigms with executable reward functions, not just dataset-style logs (Bonatti et al., 2024).
Its cloud-native architecture allows execution and evaluation at a scale and speed (O(10–20 minutes) for ∼150 tasks) impractical with traditional, serial benchmarks (Bonatti et al., 2024).

7. Significance, Current Limitations, and Future Directions

WAA is the current standard for measuring generalist Windows computer-use agent robustness and adaptability. It exposes key challenges—cross-OS generalization, precise GUI grounding, and robust multi-application planning—that drive architectural innovation in the field. The leading systems (VLAA-GUI, bBoN, Agent S3) have narrowed the human-agent performance gap to within 13–14pp (61% agent vs 74–75% human at 100 steps) (Gonzalez-Pumariega et al., 2 Oct 2025, Han et al., 23 Apr 2026). A plausible implication is that modular control and trajectory-level selection will be core to further advances.

Current limitations include:

Persistently low SR in certain applications (notably Office/LibreOffice), due to intricate UI and interaction workflows.
Remaining ungrounded failures in visually ambiguous or non-standard interface elements.
The need for richer feedback beyond binary success/failure and improved benchmarking in settings where tools (e.g., direct search, coding agents) are essential.

Continued evolution (e.g., WAA-V2) is directed toward tighter task curation, fairer evaluation (snapshot resets, infeasible task removal), and greater alignment with robust autonomous desktop automation paradigms (He et al., 20 May 2025).

References

(Bonatti et al., 2024) Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
(Agashe et al., 2024) Agent S: An Open Agentic Framework that Uses Computers Like a Human
(Agashe et al., 1 Apr 2025) Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
(He et al., 20 May 2025) Efficient Agent Training for Computer Use
(Wang et al., 2 Sep 2025) UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
(Gonzalez-Pumariega et al., 2 Oct 2025) The Unreasonable Effectiveness of Scaling Agents for Computer Use
(Han et al., 23 Apr 2026) VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation