AndroidWorld Benchmark

Updated 18 December 2025

AndroidWorld is a dynamic benchmarking environment that evaluates autonomous mobile GUI agents using real Android applications through diverse and parameterized tasks.
It employs state-driven reward checking and deterministic setup/teardown routines to ensure reproducible, robust evaluation of planning, perception, and reinforcement learning methods.
The benchmark measures agent performance via metrics like Task Success Rate (TSR) over long-horizon, multi-app workflows, highlighting both strengths and unresolved challenges.

AndroidWorld is a dynamic, high-fidelity benchmarking environment for evaluating autonomous mobile GUI agents operating on real-world Android systems. Developed to address the absence of reproducible, challenging testbeds for the field, AndroidWorld encapsulates the diversity, dynamism, and reward sparsity characteristic of practical mobile interaction. Its design enforces realism, thoroughness, and robustness in the evaluation of planning, perception, generalization, and reinforcement learning methods for multimodal, instruction-conditioned agents. AndroidWorld is widely adopted for empirical studies of both foundation model agents and specialized GUI-control policies, serving as a critical standard for measuring progress in mobile human-computer interaction automation (Rawles et al., 23 May 2024, Mo et al., 20 May 2025).

1. Benchmark Design and Task Suite

AndroidWorld comprises 116 core end-to-end tasks instantiated over 20 real-world Android applications, including but not limited to Contacts, Camera, Markor, Simple Calendar, Chrome, Pro Expense, Files, and Audio Recorder (Mo et al., 20 May 2025, Ye et al., 21 Aug 2025, Yan et al., 17 Dec 2025). Each “task template” encodes a generic mobile user goal, such as "Create a new contact," "Send a message with an attachment," or "Edit and save calendar events." At runtime, these templates are dynamically parameterized—with randomization over names, numbers, dates, or content—leading to virtually infinite concrete task instantiations (Rawles et al., 23 May 2024, Lai et al., 27 Apr 2025).

Task complexity spans a wide distribution:

Easy: Single-step actions (e.g., toggling a system setting).
Medium: Multi-screen navigation and form filling (5–15 atomic operations).
Hard: Long-horizon, cross-app workflows, frequently exceeding 20 steps, with visually dependent or composite goals and tight step budgets.

The environment strictly enforces feasibility and reproducibility. Each task implements dedicated initialization (preparing system state, clearing databases), deterministic checking (validating task success or partial credit directly from system state, e.g., querying SQLite or verifying file existence), and teardown routines to guarantee fair, repeatable testing (Rawles et al., 23 May 2024, Lai et al., 27 Apr 2025).

Agents interact through atomic GUI operations—click, long-press, swipe, type, back, home, app-launch, and “finish” or “done” signals—over a standard Android emulator (Pixel 6, Android 13+) (Yan et al., 17 Dec 2025, Mo et al., 20 May 2025). Observations provided to agents are either pixel-level screenshots (2400×1080) or, in some tracks, include accessibility tree dumps with annotated element information. No privileged or simulated “oracle” actions are permitted.

2. Evaluation Metrics, Protocols, and Analysis

The principal success criterion is Task Success Rate (TSR), the fraction of tasks for which the agent reaches the goal state within a prescribed step/turn budget (commonly 10–30 actions per task) (Mo et al., 20 May 2025, Rawles et al., 23 May 2024). The success condition is defined as the underlying Android system matching all postcondition checks specified for the given task (e.g., existence of a new calendar event with correct parameters), not merely surface-level UI changes.

Emulator or system instabilities (e.g., VM crashes, slowdowns, CAPTCHA triggers) can introduce “non-model” failures; recent evaluations report Pass@1 (success on first attempt) and Pass@3 (success in any of three independent runs per task) to correct for stochastic issues (Yan et al., 17 Dec 2025). A small subset of tasks supports partial completion, with per-task scores of 1.0 (full), 0.5 (partial), or 0.0 (fail) (Andreux et al., 22 Oct 2025).

Some studies report auxiliary metrics, such as average interaction length, stepwise accuracy, offline-progress proxies (e.g., Semi-Online Performance), or ablation-based gain decompositions (Lu et al., 15 Sep 2025, Li et al., 21 Jul 2025). Human performance is typically ~80%, establishing an empirical upper bound (Lai et al., 27 Apr 2025).

3. Algorithmic Approaches and Baseline Agents

Research on AndroidWorld has driven advances in agent architectures across several paradigms:

Vision-Language Foundation Models (VLMs): Zero-shot or directly supervised policies using large models (GPT-4o, Qwen2.5-VL, Claude, Gemini, etc.) to map images and instructions to GUI actions (Zhang et al., 21 Mar 2025, Wu et al., 16 Oct 2025). Early VLM-only agents achieved 30–46% TSR; reasoned VLMs (e.g., Claude 3.7 Sonnet with chain-of-thought) raise this to ~65% (Zhang et al., 21 Mar 2025).
Planning-augmented Agents: Hierarchical planners, including EFSM-based modules (SPlanner), multi-level skill hierarchies (Mirage-1), and classical BFS solvers, reduce the cognitive load on VLMs and mitigate task-loss due to navigation drift (Mo et al., 20 May 2025, Xie et al., 12 Jun 2025). SPlanner achieves 63.8% TSR, a +28.8pp absolute improvement over the VLM baseline (Mo et al., 20 May 2025).
Verifier-driven and Modular Agents: Paradigms such as V-Droid apply LLMs as action verifiers using a discretized action space, preference-based selection, and human-agent joint annotation. This closes the generation-verification gap and enables sub-second action selection, with 59.5% TSR and 0.7s/step latency (Dai et al., 20 Mar 2025).
Hierarchical and Reflective Control: MobileUse and D-Artemis introduce reflective modules (action/trajectory/global reflection, alignment, post-execution diagnostics) and make extensive use of app-specific knowledge retrieval and consistency checks. D-Artemis attains 75.8% on GUI-Owl-32B, and MobileUse achieves 62.9% on Qwen2.5-VL-72B-Instruct (Li et al., 21 Jul 2025, Mi et al., 26 Sep 2025).
Reinforcement Learning (RL): Several RL frameworks address the challenges of sparse, terminal rewards and heavy-tailed task difficulty. Methods such as Group Relative Policy Optimization (GRPO), ADAGRPO (with SPA, AdaPR, FCF), and trajectory-level reward adjustment drive open-source models (Step-GUI-8B, MobileRL-9B) to 80.2% (Pass@3) (Yan et al., 17 Dec 2025, Xu et al., 10 Sep 2025). Semi-Online RL and off-policy approaches (SoLS-STR) are also competitive, emphasizing sample efficiency and credit assignment (Papoudakis et al., 1 Sep 2025, Lu et al., 15 Sep 2025).
Self-evolving Pipelines and Data Efficiency: Recent pipelines (Step-GUI, GUI-Owl) focus on self-generated trajectory refinement, step-reward calibration, and automatic data augmentation, enabling >90% annotation accuracy at low manual cost and strong parameter efficiency (Yan et al., 17 Dec 2025, Ye et al., 21 Aug 2025).
Vision-only and Cross-platform Generalist Agents: Surfer 2, a unified vision-only agent, reports 87.1% pass@1, leveraging hierarchical memory, decoupled planning/execution, adaptive self-verification, and action-parameter grounding for robust cross-environment transfer (Andreux et al., 22 Oct 2025).

The following table summarizes recent leading agent results on the 116-task AndroidWorld suite:

Method / Model	Success Rate (%)	Key Architecture
Step-GUI-8B / MobileRL-9B	80.2 (Pass@3)	RL + self-evolving reward pipeline / difficulty-adap. GRPO
D-Artemis (GUI-Owl-32B)	75.8	Deliberative cognitive loop, TAC/ACA/SRA modules
Mobile-Agent-v3 (GUI-Owl-32B)	73.3	Fully asynchronous self-evolving RL + reflection
SPlanner + Qwen2.5-VL-72B	63.8	EFSM-based planning + VLM executor
MobileUse	62.9	Hierarchical reflection + proactive exploration
V-Droid	59.5	Verifier-driven decision, pairwise training
Claude Sonnet Thinking (M3A)	64.7	Chain-of-thought VLM (reasoning-enhanced Claude 3.7)
Hi-Agent (72B + 7B)	56.5	Hierarchical RL, foresight advantage, joint optim.

4. Principal Technical Challenges

AndroidWorld’s difficulty arises from the following intertwined properties:

Dynamic GUIs and OOD Generalization: State and layout drift under real app updates, randomized parameters, and diverse device settings force agents to operate under genuine out-of-distribution (OOD) regimes (Mo et al., 20 May 2025, Rawles et al., 23 May 2024).
Long-horizon, Multi-app Workflows: Many tasks span >20 step horizons and require precise, error-tolerant navigation through hierarchically nested UI states. Failure recovery and persistent memory are essential (Mo et al., 20 May 2025, Andreux et al., 22 Oct 2025).
Sparse Rewards and Strict Budgets: Feedback is strictly binary and arrives only upon task completion. With step limits typically between 10–30, error tolerance is low—a single misstep can preclude success (Mo et al., 20 May 2025, Yan et al., 17 Dec 2025).
Visual and Semantic Reasoning: Agents must accurately extract semantic information from varied screen renderings, including large text blocks, complex widgets, and task-specific targets with nontrivial grounding requirements (Ye et al., 21 Aug 2025, Mo et al., 20 May 2025).
Security and Robustness: Agents are highly vulnerable to environmental injection attacks (AEIA)—specifically, adversarial notifications and timing attacks—causing up to 93% attack success rate and destabilizing task completion (Chen et al., 18 Feb 2025).
Scalability and Data Scarcity: Manual annotation (e.g., EFSM modeling, trajectory curation) can be costly; much effort has been invested in automated data collection, synthetic augmentation, and reward-program induction (Yan et al., 17 Dec 2025, Sapora et al., 2 Oct 2025).

5. Analysis of Strengths, Weaknesses, and Empirical Insights

The combination of parameterized task templates, underlying state-based reward checking, and dynamic UI randomness ensures that AndroidWorld provides a high-fidelity, non-trivial testbed for GUI agent development. Performance gaps with human operators (~80% success rates vs. leading model agents at 60–80%) persist, especially on hard, multi-app, and visually ambiguous tasks (Rawles et al., 23 May 2024, Mo et al., 20 May 2025, Yan et al., 17 Dec 2025).

Notable strengths include:

Rigorous OOD transfer: Ensures fair benchmarking of generalization and robustness beyond narrow trajectory overfitting.
Reward correctness: System-state driven evaluation eliminates mismatches due to UI rendering or incidental layout changes.
Flexible, transparent protocol: Support for both pass@k and per-task breakdown fosters robust, repeatable empirical comparison.

Key limitations and unresolved challenges:

Sparse reward and stepwise feedback: The lack of intermediate reward drives research on step-reward shaping, curriculum design, and planning-augmented prompting (Xu et al., 10 Sep 2025, Gu et al., 14 Aug 2025).
Annotation, EFSM modeling, and data curation overhead: App modeling remains semi-manual (e.g., SPlanner’s EFSMs require 1–2 hours per app) (Mo et al., 20 May 2025).
Security vulnerabilities: Environmentally injected attacks can cause catastrophic agent failures, with only marginal improvements via naive defense prompts; no mainstream deployment yet includes strong trust verification (Chen et al., 18 Feb 2025).
Partial observability and vision errors: In pixel-only agent tracks, agents sometimes fail to disambiguate visually similar UI elements or suffer from OCR fragility, especially for hard-coded color/state cues (Yan et al., 17 Dec 2025, Rawles et al., 23 May 2024).
Evaluation scaling: Emulator instability and “non-model” failures motivate the use of multiple trials and pass@k, and highlight the need for hardware-in-the-loop, real-device evaluation (Yan et al., 17 Dec 2025).

6. Impact, Adoption, and Research Directions

AndroidWorld has catalyzed a wave of methodological advancement in the GUI agent field:

It serves as the central testbed for research on RL (ADAGRPO, Trajectory-aware RL, SoLS-STR, UI-S1, Semi-Online RL), planning (EFSM, hierarchical task abstraction), and multimodal perception (chain-of-thought VLMs, verification-based frameworks) (Mo et al., 20 May 2025, Xu et al., 10 Sep 2025, Yan et al., 17 Dec 2025, Dai et al., 20 Mar 2025).
It has shaped the trajectory of dataset and pipeline design: self-evolving, reward-calibrated trajectory pipelines, and synthetic knowledge transfer from reasoning-intensive, non-GUI tasks (e.g., mathematical CoT data) (Zhang et al., 14 Apr 2025).
It provides a natural platform for security/robustness testing, revealing core limitations in agents’ trustworthiness and adversarial resilience (Chen et al., 18 Feb 2025).

Current open challenges include scaling reward models to partial credit and real-world verification, closing the human–agent performance gap, robustness under attack, data-efficient long-horizon reasoning, standardization for privacy-protecting on-device execution, and adaptation to rapidly changing app-ecosystem versions. Efforts such as Step-GUI’s Model Context Protocol (GUI-MCP), cross-benchmark integration (including AndroidDaily), and vision-only generalists (Surfer 2, Hi-Agent) point toward a unified, real-world deployment standard for digital agents (Yan et al., 17 Dec 2025, Andreux et al., 22 Oct 2025, Wu et al., 16 Oct 2025).