AndroidWorld Benchmarking Framework

Updated 9 May 2026

AndroidWorld Framework is a dynamic, programmatically controlled Android environment designed for benchmarking autonomous agents using real-world mobile tasks and reproducible evaluations.
It employs extensible task construction, deterministic seeding, and clear reward signals to rigorously assess diverse agent architectures in long-horizon experiments.
Serving as the de facto testbed, the framework supports hierarchical, deliberative, and reinforcement learning agents, enabling comparative performance analysis across varied task difficulties.

AndroidWorld Framework defines a dynamic, programmatically controlled Android mobile environment for benchmarking autonomous agents performing human-level control of real-world applications. Its design features extensible task construction, stateful evaluation logic, ground-truth reward signals, and direct emulation of the Android OS, enabling rigorous assessment of agent capabilities in long-horizon, parameterized mobile workflows. AndroidWorld has become the de facto testbed for state-of-the-art mobile device agents, including hierarchical planners, reinforcement learning architectures, and cognitive deliberation loops.

1. System Architecture and Task Construction

AndroidWorld operates atop a full-featured Android emulator (e.g., Pixel 6/Android 13) driven through the AndroidEnv library and ADB instrumentation. The framework provides direct access to the emulator via gRPC/ADB, enabling two primary agent-facing API calls: get_state() (capturing a screenshot and accessibility tree) and step(action) (executing atomic UI operations via ADB primitives such as tap, swipe, input_text, etc.) (Rawles et al., 2024).

Task logic is encapsulated in the TaskEval class, with each benchmark task implemented as a subclass defining four critical methods:

generate_random_params() randomly samples task parameters to instantiate unique task variants.
initialize(env) manipulates system/app state for reproducibility (e.g., populating databases, pushing files).
is_successful(env) inspects the system state (via direct queries to SQLite, filesystem, or settings) and returns scalar reward $R \in [0,1]$ .
teardown(env) undoes side effects, ensuring clean state for every episode.

Task templates use structured natural language instructions with slot filling, supporting virtually unlimited variations. A global seed ensures reproducibility across millions of possible configurations. Tasks include information retrieval (e.g., "List events on {date}") and side-effect benchmarks (e.g., "Send SMS to {number} with text {message}") (Rawles et al., 2024).

2. Observation, Action Space, and Reward Model

The environment exposes an explicit Markov Decision Process (MDP) or Contextual MDP abstraction (Gu et al., 8 Mar 2026):

Observation Space: At each timestep $t$ , the agent receives a high-resolution screenshot $o_t \in \mathbb{R}^{H \times W \times 3}$ , optionally augmented with a parsed UI element list derived from the accessibility tree.
Action Space: JSON-encoded atomic GUI commands (click, swipe, input_text, open_app, etc.) as well as functional calls (long_press, system_button, scroll), parameterized by coordinates or content (Mi et al., 26 Sep 2025).

Episode transitions obey the emulator's deterministic or stochastic update logic: $o_{t+1} = \text{Env.step}(a_t)$

Reward is sparse and episodic, produced by the ground-truth checker at episode termination. $R(\tau) = 1$ if the system state matches the natural language instruction, else $0$. Composite scores may be averaged over subgoals where appropriate. This architecture eliminates reliance on brittle UI-matching or LLM-based post-evaluation (Rawles et al., 2024, Mi et al., 26 Sep 2025).

3. Benchmarking Protocols, Difficulty, and Extensibility

AndroidWorld comprises 116 core tasks spanning 20 real-world apps, with extensibility to any APK accessible via ADB (e.g., from F-Droid). Tasks are dynamically instantiated for both training and evaluation, optionally split into train/test according to instance, template, or application for generalization analysis (Gu et al., 8 Mar 2026):

Unseen Instance: Evaluates robustness to parameter instantiation.
Unseen Template: Probes transfer to withheld task templates.
Unseen App: Enforces compositional generalization to new applications.

The default success metric is the mean success rate $S = \frac{\#\text{successful tasks}}{\#\text{attempted tasks}}$ . Step budgets and task difficulty stratification (easy, medium, hard) sharpen evaluation against long-horizon, ambiguous, or edge-case episodes. The framework can be programmatically extended by subclassing and registering additional TaskEval logic or app-specific task parameters (Rawles et al., 2024, Gu et al., 8 Mar 2026).

4. Integration with Agent Architectures

AndroidWorld is the standard evaluation target for multiple agent paradigms:

Hierarchical Cognitive Agents: K²-Agent leverages a planner–executor architecture, partitioning "knowing what" (declarative subgoal planning via SRLR self-evolution from demonstrations) and "knowing how" (procedural execution via C-GRPO) (Wu et al., 28 Feb 2026). The agent operates entirely from raw screenshots, eschewing privileged accessibility input. Knowledge bases are initialized and self-updated through Summarize–Reflect–Locate–Revise cycles.
Deliberative Multi-Stage Agents: D-Artemis instantiates a cognitive loop of tip retrieval, pre-execution alignment (Thought-Action Consistency and Action Correction), and post-execution reflection to inform subsequent decisions. No RL or dataset pre-training is required: strong generalization is achieved by orchestrating modular cognitive routines as the agent iteratively interacts with the emulator (Mi et al., 26 Sep 2025).
Reinforcement Learning Agents: AndroidWorld-Generalization employs Group Relative Policy Optimization (GRPO), training VLMs asynchronously with containerized rollout servers and curriculum sampling. The open-source RL stack supports fine-grained evaluation across seen and zero-shot splits (Gu et al., 8 Mar 2026).

Backbone integration is supported for diverse models, including Qwen2-VL-7B, UI-TARS, and GPT-family architectures (Wu et al., 28 Feb 2026, Gu et al., 8 Mar 2026).

5. Empirical Results and Comparative Performance

AndroidWorld benchmarks enable direct, quantitative comparison among architectures:

State-of-the-art success rates: 76.1% for K²-Agent (screenshot only, Qwen2.5-VL-72B+7B), 75.8% for D-Artemis (Qwen2.5-VL-72B), both surpassing prior open-source and closed-source systems such as M3A (30.6%), UI-TARS-2 (73.3%), and Mobile-Agent-v3 (73.3%) (Wu et al., 28 Feb 2026, Mi et al., 26 Sep 2025).
Difficulty breakdown: For K²-Agent, category success rates are 92% (easy), 78% (medium), and 58% (hard). D-Artemis performance is similarly stratified but is driven by reflection and consistency components (Wu et al., 28 Feb 2026, Mi et al., 26 Sep 2025).
Generalization: K²-Agent demonstrates dual generalization: on ScreenSpot-v2 (single-step, cross-platform), its executor matches large closed models; for Android-in-the-Wild (long-horizon), zero-shot transfer yields strong gains (86.5% on AitW-General; 68.3% on AitW-WebShopping) (Wu et al., 28 Feb 2026). RL-trained 7B agents with curriculum surpass supervised baselines by 26.1 pp on unseen instances but with more limited gains on templates or apps, highlighting persistent generalization barriers (Gu et al., 8 Mar 2026).

Agent	Input	Success Rate
M3A (GPT-4T)	Screenshot + A11y	30.6
FinalRun (GPT-5)	Screenshot + A11y	76.7
UI-TARS-2 (Seed-1.6B)	Screenshot	73.3
Mobile-Agent-v3	Screenshot	73.3
K²-Agent (72B+7B)	Screenshot	76.1 ± 1.0
D-Artemis (Qwen2.5-VL-72B)	Screenshot	75.8

For reproducibility, the AndroidWorld and AndroidWorld-Generalization suites (including Dockerfile, task definitions, rollouts, and model checkpoints) are available as open-source releases (Rawles et al., 2024, Gu et al., 8 Mar 2026).

6. Impact, Limitations, and Extensibility

AndroidWorld has become the principal mobile-agent benchmarking suite, enabling detailed analysis of learning, planning, and generalization capabilities on real Android applications. Its design addresses limitations in earlier environments (MiniWoB++, OSWorld) by supporting task parameterization, robust ground-truth checking, and reproducibility through strict environment resets and deterministic seeding (Rawles et al., 2024). Agents can be evaluated on robustness to task variation, error typology (perceptual, grounding, reasoning), and time efficiency.

A notable challenge remains the substantial drop in zero-shot transfer to novel templates and unseen apps, even for large models and curriculum-based RL (e.g., only +8.3 pp over baseline on unseen apps) (Gu et al., 8 Mar 2026). Robustness analyses further demonstrate non-determinism and performance variance under different seeds, supporting the need for multiple trial averages and careful protocol design.

The framework supports seamless extension to new apps via APKs and further reward logic, as well as integration of both supervised and RL agents. Its event-driven, containerized infrastructure provides a scalable foundation for large-scale, parallel rollout and evaluation.

7. Future Research Directions

Key open directions include:

Enhanced generalization through improved curriculum strategies, modular architectures, or advanced knowledge transfer mechanisms across apps and UI domains (Wu et al., 28 Feb 2026, Gu et al., 8 Mar 2026).
More interpretable or verifiable agent reasoning and reflection, inspired by deliberative architectures such as D-Artemis (Mi et al., 26 Sep 2025).
Integration with human-in-the-loop evaluation for tasks beyond existing reward logic, and pathway toward universal, cross-platform agents operable on both desktop and mobile (Rawles et al., 2024).
Exploration of persistent memory, meta-learning, or context adaptation techniques to address persistent gaps in hard and out-of-distribution task splits (Gu et al., 8 Mar 2026).

The AndroidWorld framework thus defines a robust and extensible foundation for research in mobile autonomous agents, providing critical infrastructure for reproducible, granular assessment of cognitive, learning-based, and deliberative approaches in a dynamic, high-dimensional real-world setting.