AndroidWorld & AndroidLab Benchmarks

Updated 24 October 2025

AndroidWorld and AndroidLab benchmarks are dynamic evaluation platforms designed to assess mobile agent performance using real Android apps and randomized task parameters.
They implement rigorous reproducibility protocols and multimodal interaction modalities, ensuring fine-grained, reproducible metrics such as success rates and operation ratios.
The platforms facilitate cross-domain generalization and enable diverse applications from agent performance analysis to hardware benchmarking and security evaluations.

AndroidWorld and AndroidLab benchmarks constitute two principal evaluation suites designed to measure performance, robustness, and adaptability of autonomous agents operating within Android OS environments. They are distinguished by dynamic, programmatic tasks rooted in real apps, multimodal interaction modalities, and rigorous protocols for reproducible, fine-grained quantitative assessment, serving as reference platforms for mobile agent research.

1. Benchmark Environment and Design Principles

AndroidWorld provides a dynamic benchmarking platform featuring 116 tasks distributed over 20 authentic Android applications—spanning categories such as productivity, communication, multimedia, and file management—executed on Android emulators running standardized OS builds (Pixel 6, Android 13, F‑Droid app sources) (Rawles et al., 23 May 2024). Distinct from synthetic or static web-based testbeds (e.g., MiniWoB++), tasks are instantiated with randomized, natural language instructions, yielding a virtually unbounded suite of scenarios through parameterization ( $\mathcal{T}_{\text{dyn}}$ ).

Each AndroidWorld task is encapsulated as a Python class with controlled initialization, state modification, and systematic teardown. Reward signals derive directly from OS-level state, employing ADB queries, SQLite database inspection, and file system checks that distinguish true system changes from superficial UI modifications. Data collection is further standardized through consistent screen resolutions ( $2400 \times 1080 \times 3$ ) and integrated accessibility trees.

Similarly, AndroidLab offers a reproducible, systematic framework based on Android Virtual Devices, pre-defined device states, and 138 tasks covering nine diverse apps (e.g., Bluecoins, Calendar, Maps.me, Zoom) (Xu et al., 31 Oct 2024). AndroidLab partitions tasks into "operation tasks" ( $T(E,I,F,M) \rightarrow (S_1, \ldots, S_n)$ ) and "query tasks" ( $T(E,I,F,M) \rightarrow (S_1, \ldots, S_n,A)$ ), establishing reproducibility through locked device images, offline testing, and deterministic initialization.

Both benchmarks emphasize:

Real applications programmed for evaluation, not synthetic UI fragments.
Unlimited, parametric task variation for robustness.
Rigid reproducibility protocols: initialization, success checking, teardown, versioned environments.

2. Task Construction, Modalities, and Action Space

AndroidWorld tasks are dynamically generated with random parameter instantiation, rendering the distribution non-deterministic and well-suited for robustness analysis. For each task class $C_T$ , placeholders ( $\{\mathrm{number}, \mathrm{message}, \mathrm{file\_name}, \mathrm{date}\}$ ) are filled via controlled random seeds, generating substantial variability. Each class provides initialize_task, is_successful, and teardown methods.

Observation modalities consist of:

Raw screenshots ( $\mathbb{R}^{2400 \times 1080 \times 3}$ )
Processed accessibility tree (node text, class_name, bounding_box)
Set-of-Mark (SoM) annotations for multimodal models

The action space is constructed as a fixed set $\mathcal{A}(t)$ comprising both UI-dependent actions (tap, swipe, type, long-press) and UI-independent actions (home, back, finish). AndroidLab additionally supports two interaction modes:

XML Mode: agents receive compressed XML representations (suited for LLMs)
SoM Mode: agents access screenshots annotated with indexed UI elements (suited for LMMs)

Both platforms ensure agents operate under a unified action space, enabling head-to-head comparison of LLMs and LMMs.

3. Evaluation Protocols and Performance Metrics

Success metrics are rigorously defined and decomposed for granular analysis:

Task Success Rate (SR): Fraction of completed tasks ( $\mathrm{SR} = \frac{\text{completed}}{\text{all tasks}}$ )
Sub-Goal Success Rate: Fine-grained evaluation, breaking tasks into $N$ subgoals, matching device XML tree states post-action
Reversed Redundancy Ratio (RRR): Reciprocal of operation path length versus human demonstration
Reasonable Operation Ratio (ROR): Percentage of actions resulting in effective, non-redundant screen changes

Robustness is a paramount concern: AndroidWorld employs multiple random seeds and parameter instances per task, with success rate measured over large sample sets. Agents are tested for intra-task performance sensitivity and multi-step error propagation (Rawles et al., 23 May 2024).

Recent benchmarks feature advanced metrics, such as Semi-Online Performance (SOP) (Lu et al., 15 Sep 2025):

$\mathrm{PG} = \frac{1}{N} \sum_{i=1}^N \frac{s_i}{t_i}, \qquad \mathrm{TSR} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}\left[s_i = t_i\right]$

where $s_i$ is successful steps in task $i$ , $t_i$ is total steps. SOP correlates strongly with online execution ( $R^2 = 0.934$ ).

4. Agent Architectures, Training Paradigms, and SOTA Performance

Agents span LLMs, LMMs, reasoning-enabled VLMs, and multi-agent frameworks:

Verifier-driven agents (V-Droid): Employ LLMs as verifiers for candidate actions within a discretized space. Pairwise process preference (P³) training emphasizes contrastive discrimination, yielding low latency ( $\sim$ 0.7s/step) and strong success rates (59.5% on AndroidWorld, 38.3% on AndroidLab) (Dai et al., 20 Mar 2025). Action selection is formalized as: $\alpha_k^*(t) = \arg\max_{\alpha \in \mathcal{A}_k(t)} V(\rho_k^t(\alpha))$ where $V(\rho)$ is the verification score.
Reasoning-enabled VLMs: Chain-of-thought models (e.g., Claude 3.7 Sonnet + reasoning) attain new SOTA on AndroidWorld (64.7%), but effect is model-specific and can introduce higher resource requirements. Surprisingly, reasoning is not universally beneficial across static and interactive benchmarks, with failure sets disjoint between reasoning/non-reasoning models (Zhang et al., 21 Mar 2025).
Hierarchical Reflection Agents (MobileUse): Incorporate three-layer reflection (action, trajectory, global) and proactive exploration. Reflection-on-demand selectively triggers reflection based on mean token log-probabilities. Performance reaches 62.9% (AndroidWorld) and 44.2% (AndroidLab) (Li et al., 21 Jul 2025).
Semi-online RL Agents (UI-S1): Simulate online RL via off-policy rollouts and patching mechanisms, optimizing composite rewards with discounted future returns and hierarchical advantages: $R_t^{(i)} = \sum_{k=t}^T \gamma^{k-t} r_k^{(i)},\qquad A(a_t^{(i)}) = A^E(\tau^{(i)}) + \omega A^S(a_t^{(i)})$ Enhances multi-turn performance (12% improvement on AndroidWorld) (Lu et al., 15 Sep 2025).
Multi-agent systems (ColorAgent): Integrate reinforcement learning, self-evolving training, knowledge retrieval, hierarchical reflection, and collaborative orchestration to achieve 77.2% (AndroidWorld) and 50.7% (AndroidLab) (Li et al., 22 Oct 2025).

All leading approaches leverage complex interaction histories, error recovery, and either verification or explicit reasoning strategies.

5. Data Scarcity, Cross-domain Generalization, and Training Recipes

Data scarcity for GUI trajectories remains a limiting factor. AndroidGen employs in-context retrieval (ExpSearch), step-wise planning (ReflectPlan), automatic action verification (AutoCheck), and fine-grained scoring (StepCritic) to exploit unlabeled or self-generated trajectories (Lai et al., 27 Apr 2025). The augmentation algorithm decomposes incomplete tasks into new training examples, enabling continual improvement.

Cross-modal and cross-domain generalization are effective: pre-training on non-GUI, reasoning-intensive data (e.g., mathematical QA, chart understanding) substantially boosts GUI agent performance. For instance, text-only mathematical data lifted AndroidWorld SR by 5.4%, and multimodal math by 6.3% (Zhang et al., 14 Apr 2025). Mixing sparse GUI trajectory samples during mid-training mitigates catastrophic forgetting and smooths adaptation.

GUI perception data, while intuitively aligned, has a comparatively minor impact versus reasoning tasks. Optimized mixture datasets yield up to 12.2% absolute improvement on AndroidWorld.

6. Limitations, Controversies, and Future Evaluation

Despite rapid advances, several limitations persist:

Benchmarks (AndroidWorld, AndroidLab) presently cover a subset of real-world mobile environments, exhibiting a bias toward atomic or semi-scripted tasks, and may insufficiently capture rare events or composite workflows (Li et al., 22 Oct 2025).
Overfitting remains a concern for larger models, with demonstrated tradeoffs between training reward and real-world generalization.
Chain-of-thought reasoning, while powerful in specific models and interactive testbeds, engenders significant resource (token) overhead and inconsistent performance, underscoring the need for dynamic invocation policies and tailored fine-tuning (Zhang et al., 21 Mar 2025).
SOP, while highly correlated with true online success, suffices as a proxy but does not substitute live online deployments.

Current research trajectories advocate for:

Expanded multidimensional evaluation paradigms encompassing user intent, adaptability, and safety.
Multi-agent collaboration frameworks capable of orchestrating complex multi-step and concurrent tasks.
Enhanced security measures, including granular permission controls, sandboxed execution, and proactive anomaly detection.

7. Applications Beyond Standard Evaluation: Hardware, Security Analysis, and Microbenchmark Decomposition

Both AndroidWorld and AndroidLab facilitate diverse research outputs beyond GUI agent development:

Hardware benchmarking: AI Benchmark and AIoTBench suites quantify device inference ability over multiple frameworks, employing metrics such as Valid Images Per Second (VIPS) and Valid FLOPs Per Second (VOPS) (Ignatov et al., 2018, Ignatov et al., 2019, Luo et al., 2020).
Security analysis: API usage and vulnerability coverage methodologies from BenchPress are generalizable to AndroidWorld/AndroidLab for tool evaluation and suite extension (Mitra et al., 2019).
Microbenchmark decomposition: Regression-based analysis can profile benchmark scores on these platforms, offering insight into system bottlenecks, architectural biases, and resource contributions through nonnegative least squares models (Matagawa et al., 2016): $y = \beta_0 + \sum_{i=1}^m \beta_i C_i + \varepsilon$ where $C_i$ are compound microbenchmark scores for distinct operations.

Collectively, these benchmarks provide a reproducible, extensible foundation for measuring the capabilities of mobile platforms, agents, and systems.

AndroidWorld and AndroidLab have thus emerged as keystones for rigorous, reproducible evaluation of mobile agents and systems, spanning algorithmic, operational, hardware, and security analysis dimensions. Their methodological innovations, dynamic task construction, multimodal evaluation, and evolving performance metrics continue to inform and challenge state-of-the-art research in autonomous Android interaction.