Papers
Topics
Authors
Recent
Search
2000 character limit reached

DroidTask Benchmark: Evaluating Mobile-UI Automation

Updated 31 January 2026
  • DroidTask Benchmark is a dynamic evaluation suite featuring 116 reproducibly instantiated tasks spanning 20 diverse Android apps.
  • It employs parameterized natural-language specifications and systematic state management (initialize, get_state, is_successful, teardown) for robust mobile automation testing.
  • Evaluation protocols based on state-derived rewards reveal significant performance gaps between current agents and human operators.

The DroidTask Benchmark, implemented within the AndroidWorld framework, is a large-scale, dynamic evaluation suite for autonomous agents tasked with controlling real-world Android applications through their user interfaces. Encompassing 116 programmatic tasks drawn from 20 diverse Android apps, it represents the most comprehensive mobile-UI control benchmark to date. Its design is distinguished by parameterized, natural-language task specifications, systematic state initialization and teardown, reproducibility guarantees, and reward assignment grounded directly in system state. This enables rigorous, scalable, and realistic assessments of agent capabilities in the domain of mobile automation (Rawles et al., 2024).

1. Benchmark Scope and Task Construction

DroidTask comprises 116 tasks that span a representative selection of Android application domains, including productivity (Simple Calendar Pro, Joplin, Markor), system utilities (Settings, Contacts, Files), and media applications (VLC, Retro Music). Each task is specified by a natural-language template with placeholders for arguments (e.g., “In Simple Calendar Pro, create a calendar event on {year}-{month}-{day}…”). At run-time, each episode’s task is instantiated by sampling arguments from controlled random distributions, resulting in a combinatorially rich range of goal configurations.

Unlike benchmarks employing a fixed, static test set, DroidTask dynamically generates new task instances and emulator states for each episode. This allows agents to encounter effectively unbounded task variations across millions of runs, enforcing generalization, robust perception, and instruction following. Task complexity ranges from elementary (e.g., changing a system setting) to composite sequences requiring multi-step reasoning and environment manipulation (e.g., finding, editing, and saving items under distractor conditions).

2. Environment Design and Reproducibility Mechanisms

AndroidWorld’s environment architecture underpins reproducibility and reliability. Each task implements three hermetic methods:

  • initialize(): Resets the emulator (Pixel 6, Android 13, date fixed to October 15, 2023) to a clean snapshot and programmatically modifies the system state (e.g., via SQLite, file system, or adb) to encode noisy distractors and ground-truth targets.
  • get_state(): At every agent step, captures full-resolution (2,400 × 1,080) screenshots and a live accessibility tree, pausing for transient UI elements to stabilize, providing multimodal perceptual input.
  • is_successful(): After task completion (TERMINATE action or step budget exhausted), queries system state to determine success—inspecting SQLite tables, file presence, or settings as required. Rewards are usually binary (1 for success, 0 for failure); for composite goals, fractional rewards are used, e.g., r=(file_exists+message_exists)/2.0r = (\mathrm{file\_exists} + \mathrm{message\_exists}) / 2.0.
  • teardown(): Cleans up side effects to ensure that all episodes begin from the identical initial state.

This design decouples reward signals from brittle UI pixel-matching or text heuristics, conferring invariance to superficial changes and robustness to diverse task parameterizations.

3. Evaluation Protocols and Metrics

The dominant metric is Success Rate, defined as

SuccessRate=#taskscompleted116\mathrm{SuccessRate} = \frac{ \#\,\mathrm{tasks\,completed} }{116 }

An agent’s performance is aggregated across tasks, with 95% Wilson confidence intervals reported. This evaluation strategy avoids overfitting to particular goal instantiations or screen layouts and, by supporting thousands of random seeds, directly quantifies both mean performance and variance under realistic conditions.

Pertinent experimental findings reveal that even the strongest agent—M3A using GPT-4 Turbo with accessibility tree input—succeeds on only 30.6% of DroidTask tasks, whereas a human operator achieves 80%. On the synthetic MobileMiniWoB++ set, agent performance reaches upwards of 67%, but this does not translate to more complex and variable real-world applications.

Table 1. Performance on AndroidWorld and MobileMiniWoB++

Agent (Input, Model) SR₍AndroidWorld₎ SR₍MobileMiniWoB++₎
Human (screen, —) 80.0% 100.0%
M3A (a11y tree, GPT-4 Turbo) 30.6% 59.7%
M3A (a11y tree, Gemini 1.5 Pro) 19.4% 57.4%
M3A (SoM + a11y, GPT-4 Turbo) 25.4% 67.7%
M3A (SoM + a11y, Gemini 1.5 Pro) 22.8% 40.3%
SeeAct (SoM + a11y, GPT-4 Turbo) 15.5% 66.1%

4. Baseline Agents and Robustness

The primary reference agent, M3A (a multimodal “reason-and-act” system), ingests either the accessibility tree or a concatenation of screenshot and accessibility tree (“SoM + a11y”), leveraging LLMs (GPT-4 Turbo, Gemini 1.5 Pro) for inference. For comparative purposes, the SeeAct web agent was ported to the Android environment, but exhibited lower effectiveness on mobile relative to web benchmarks.

Systematic evaluation over numerous random seeds exposes high variance—certain parameterizations of semantically identical tasks can cause dramatic swings in agent success rates. Even holding task and state constant, stochasticity in LLM sampling produces run-to-run variability. Analysis of fixed versus variable seed conditions demonstrates both the difficulty and the necessity of averaging over large populations of task instantiations to obtain representative measures of agent competency.

5. Diagnostic Insights and Identified Challenges

Empirical results pinpoint several failure modes: agents exhibit difficulty with robust perception (e.g., detecting non-standard UI widgets like checkboxes and sliders), precise input grounding (such as pre-pending text to an existing field), and hierarchical reasoning sequences (for example, looped item deletion versus batch operations). Furthermore, agents validated on synthetic or browser environments do not transfer effectively to mobile settings, implicating the need for explicit adaptation to Android-specific interaction paradigms (gesture controls, native widgets, system APIs, state fluctuations).

Task robustness analysis (e.g., across “AddExpense,” “EditNote,” “DeleteFile”) reveals that measures based on static seeds dramatically understate true environmental variance. Statistically significant differences (p<0.05) between fixed and randomized seeds for success rates further reinforce the inadequacy of limited, deterministic test sets for fair measurement.

6. Significance and Research Directions

DroidTask’s scale, reproducibility, and dynamic construction mechanisms position it as a demanding and high-fidelity arena for mobile agent development and evaluation. By deriving reward signals exclusively from durable system state, the benchmark avoids the confounds of pixel or string-matching, establishing a reproducible testbed for UI automation grounded in genuine task completion.

The environment’s dynamic goal sampling highlights the need for research into online reinforcement learning, continual learning, and curriculum generation within mobile contexts—capabilities not yet demonstrated by current agents. The significant domain gap between desktop/synthetic and mobile environments, in conjunction with the breadth of DroidTask’s parameterization, suggests that future universal agents will require robust cross-platform reasoning, enhanced perception, and interaction dexterity (Rawles et al., 2024). A plausible implication is that platform-specific architectural or pretraining strategies will become increasingly necessary to close the performance gap between algorithmic and human operators.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DroidTask Benchmark.