Papers
Topics
Authors
Recent
Search
2000 character limit reached

AutoDroid: LLM Android Automation

Updated 2 March 2026
  • AutoDroid is an LLM-powered Android automation framework that executes natural language commands on any app without manual intervention.
  • It employs a two-phase process—offline UI exploration with memory synthesis and online task execution—to dynamically map app states to actions.
  • Evaluations demonstrate notable gains in action accuracy and task success by integrating memory injection and query optimization strategies.

AutoDroid is an LLM-powered Android automation framework that enables arbitrary natural language tasks to be executed on any Android application without manual developer intervention. The system leverages the commonsense reasoning and language understanding capacity of LLMs, augmented with domain-specific application knowledge captured through automated UI exploration. By dynamically bridging app-specific functionality with LLM inference and optimizing prompting with memory injection and query condensation, AutoDroid achieves high task completion rates and low per-action error in complex, unseen application contexts (Wen et al., 2023).

1. System Overview and Workflow

AutoDroid operates in two interleaved phases—a preparatory offline exploration phase and an online task execution phase—which are bridged by a persistent, vectorized application memory.

Offline Exploration and Memory Synthesis:

  • The system launches a UI exploration agent to collect a UI Transition Graph (UTG), G=(V,E)G = (V, E), where VV denotes observed UI states and EE the actions driving state transitions.
  • For each widget ee in a state u∈Vu \in V, an LLM is prompted to summarize the widget's "functionality" as a simulated subtask (e.g., ‘Add contact’, ‘Open calendar’).
  • The memory module stores tuples ⟨f(e),path(u),u,e⟩\langle f(e), \text{path}(u), u, e \rangle, with f(e)f(e) the function summary.

Online Task Execution:

  • The user issues a task TT as a natural language command.
  • The prompt generator serializes the current UI state UU into a compact HTML-style representation D(U)D(U) and identifies the VV0 most relevant simulated tasks in memory (via cosine similarity in a sentence embedding space).
  • The prompt injects memory-derived hints as "onclick" annotations for candidate widgets.
  • The selected LLM is prompted (schema-enforced output) and returns an VV1 tuple, which is executed by the Android Accessibility API. The UI is reobserved, and steps repeat to completion (Wen et al., 2023).

2. Functionality-Aware UI Representation

The core UI representation, VV2, maps the Android GUI tree VV3 to a sequenced set of HTML-like tags, each capturing:

  • The tag type (e.g., "button", "input"),
  • A unique identifier,
  • Visible textual label,
  • A set of properties including onclick annotations derived from memory matches.

For an interactive widget VV4, VV5 includes:

VV6

When a top-VV7 memory hint VV8 is available for VV9, the prompt appends:

EE0

Schema enforcement ensures LLM outputs remain structured, mapping directly to UI actions and supporting automatic downstream execution (Wen et al., 2023).

3. Exploration-Based Memory Injection

Memory injection addresses the deficiency of LLMs in capturing app-specific transition semantics:

  • Offline sampling explores the full UTG of each app, using an LLM to produce functionality summaries for each widget, which are embedded with a sentence encoder and stored in a vector index.
  • Online, given user task EE1, the agent retrieves memory hints EE2, with EE3 the embedding.
  • Relevant hints are injected directly into EE4, informing LLM inference with context about likely state transitions, thus greatly reducing hallucinated or irrelevant actions (Wen et al., 2023).

4. Multi-Granularity Query Optimization

AutoDroid deploys several techniques to reduce inference cost and latency:

  • Invisible Filtering prunes non-interactive, hidden, or container nodes;
  • Semantic Merging collapses redundant node actions achieving the identical successor state in the UTG;
  • GUI Merging concatenates visible panels under scrollable containers, lowering multi-step tokenization cost;
  • Shortcut Navigation leverages highly relevant memory snippets (EE5) to bypass LLM invocation and directly execute pre-cached action sequences.

This composite approach halves average prompt size, cuts LLM call count by 13.7%, and reduces latency significantly (Wen et al., 2023).

5. Model Integration and Deployment Options

AutoDroid supports both cloud-based and on-device deployments:

  • Cloud integration leverages GPT-4 or GPT-3.5, typically with EE6 for prompt decoding;
  • On-device execution uses fine-tuned Vicuna-7B and an MLC-LLM backend. Action accuracy for on-device Vicuna improves from 22% (vanilla) to 57.7% after pruning and zero-shot chain-of-thought (CoT) fine-tuning.
  • The model-agnostic architecture allows seamless switching between local and remote LLMs depending on privacy/security constraints and system latency targets (Wen et al., 2023).

6. Evaluation Methodology and Empirical Performance

Evaluation uses the DroidTask benchmark, comprising 158 tasks across 13 open-source applications:

  • Each task is annotated with a ground-truth sequence of UI states EE7 and corresponding actions.
  • Metrics:
    • Action accuracy: EE8
    • Task completion: EE9

Key Results:

Configuration Action Accuracy Task Success Rate
GPT-4 + AutoDroid 90.9% 71.3%
GPT-4 Baseline (no memory) 54.5% 31.6%
Vicuna-7B + AutoDroid 57.7% 41.1%
GPT-3.5 + AutoDroid 65.1% 47.9%

Ablation studies demonstrate that memory injection increases task completion by 17% with GPT-4 and 25% with Vicuna. Zero-shot CoT fine-tuning yields a 5× improvement for on-device models (Wen et al., 2023).

7. Limitations and Prospects

AutoDroid's primary limitations include:

  • Coverage gaps: Random UI exploration may miss deeply nested or stateful interactions, leading to incomplete memory for rare control flows.
  • LLM hallucinations: Despite structured prompting and annotation, LLMs can propose non-existent widget actions when exposed to ambiguous screens.
  • Latency/Token Cost in Long Tasks: Step-wise inference for complex tasks introduces noticeable delay and expense.
  • Privacy Risks: Prompt construction potentially exposes sensitive information if UI metadata includes user data; fine-grained redaction remains an open challenge.

Proposed directions include coverage-driven exploration (e.g., RL-based), GUI validation/symbolic verification layers to suppress hallucinated outputs, hybrid local/cloud LLM scheduling for dynamic latency/cost-privacy trade-offs, and advanced planning to reuse partial sub-plans (Wen et al., 2023).


AutoDroid represents a paradigm shift in mobile UI automation, unifying automated dynamic analysis with LLM-centric reasoning while leveraging structured vector memory to enable robust, zero-shot execution of diverse, arbitrary app tasks (Wen et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AutoDroid.