AutoDroid: LLM Android Automation
- AutoDroid is an LLM-powered Android automation framework that executes natural language commands on any app without manual intervention.
- It employs a two-phase process—offline UI exploration with memory synthesis and online task execution—to dynamically map app states to actions.
- Evaluations demonstrate notable gains in action accuracy and task success by integrating memory injection and query optimization strategies.
AutoDroid is an LLM-powered Android automation framework that enables arbitrary natural language tasks to be executed on any Android application without manual developer intervention. The system leverages the commonsense reasoning and language understanding capacity of LLMs, augmented with domain-specific application knowledge captured through automated UI exploration. By dynamically bridging app-specific functionality with LLM inference and optimizing prompting with memory injection and query condensation, AutoDroid achieves high task completion rates and low per-action error in complex, unseen application contexts (Wen et al., 2023).
1. System Overview and Workflow
AutoDroid operates in two interleaved phases—a preparatory offline exploration phase and an online task execution phase—which are bridged by a persistent, vectorized application memory.
Offline Exploration and Memory Synthesis:
- The system launches a UI exploration agent to collect a UI Transition Graph (UTG), , where denotes observed UI states and the actions driving state transitions.
- For each widget in a state , an LLM is prompted to summarize the widget's "functionality" as a simulated subtask (e.g., ‘Add contact’, ‘Open calendar’).
- The memory module stores tuples , with the function summary.
Online Task Execution:
- The user issues a task as a natural language command.
- The prompt generator serializes the current UI state into a compact HTML-style representation and identifies the 0 most relevant simulated tasks in memory (via cosine similarity in a sentence embedding space).
- The prompt injects memory-derived hints as "onclick" annotations for candidate widgets.
- The selected LLM is prompted (schema-enforced output) and returns an 1 tuple, which is executed by the Android Accessibility API. The UI is reobserved, and steps repeat to completion (Wen et al., 2023).
2. Functionality-Aware UI Representation
The core UI representation, 2, maps the Android GUI tree 3 to a sequenced set of HTML-like tags, each capturing:
- The tag type (e.g., "button", "input"),
- A unique identifier,
- Visible textual label,
- A set of properties including onclick annotations derived from memory matches.
For an interactive widget 4, 5 includes:
6
When a top-7 memory hint 8 is available for 9, the prompt appends:
0
Schema enforcement ensures LLM outputs remain structured, mapping directly to UI actions and supporting automatic downstream execution (Wen et al., 2023).
3. Exploration-Based Memory Injection
Memory injection addresses the deficiency of LLMs in capturing app-specific transition semantics:
- Offline sampling explores the full UTG of each app, using an LLM to produce functionality summaries for each widget, which are embedded with a sentence encoder and stored in a vector index.
- Online, given user task 1, the agent retrieves memory hints 2, with 3 the embedding.
- Relevant hints are injected directly into 4, informing LLM inference with context about likely state transitions, thus greatly reducing hallucinated or irrelevant actions (Wen et al., 2023).
4. Multi-Granularity Query Optimization
AutoDroid deploys several techniques to reduce inference cost and latency:
- Invisible Filtering prunes non-interactive, hidden, or container nodes;
- Semantic Merging collapses redundant node actions achieving the identical successor state in the UTG;
- GUI Merging concatenates visible panels under scrollable containers, lowering multi-step tokenization cost;
- Shortcut Navigation leverages highly relevant memory snippets (5) to bypass LLM invocation and directly execute pre-cached action sequences.
This composite approach halves average prompt size, cuts LLM call count by 13.7%, and reduces latency significantly (Wen et al., 2023).
5. Model Integration and Deployment Options
AutoDroid supports both cloud-based and on-device deployments:
- Cloud integration leverages GPT-4 or GPT-3.5, typically with 6 for prompt decoding;
- On-device execution uses fine-tuned Vicuna-7B and an MLC-LLM backend. Action accuracy for on-device Vicuna improves from 22% (vanilla) to 57.7% after pruning and zero-shot chain-of-thought (CoT) fine-tuning.
- The model-agnostic architecture allows seamless switching between local and remote LLMs depending on privacy/security constraints and system latency targets (Wen et al., 2023).
6. Evaluation Methodology and Empirical Performance
Evaluation uses the DroidTask benchmark, comprising 158 tasks across 13 open-source applications:
- Each task is annotated with a ground-truth sequence of UI states 7 and corresponding actions.
- Metrics:
- Action accuracy: 8
- Task completion: 9
Key Results:
| Configuration | Action Accuracy | Task Success Rate |
|---|---|---|
| GPT-4 + AutoDroid | 90.9% | 71.3% |
| GPT-4 Baseline (no memory) | 54.5% | 31.6% |
| Vicuna-7B + AutoDroid | 57.7% | 41.1% |
| GPT-3.5 + AutoDroid | 65.1% | 47.9% |
Ablation studies demonstrate that memory injection increases task completion by 17% with GPT-4 and 25% with Vicuna. Zero-shot CoT fine-tuning yields a 5× improvement for on-device models (Wen et al., 2023).
7. Limitations and Prospects
AutoDroid's primary limitations include:
- Coverage gaps: Random UI exploration may miss deeply nested or stateful interactions, leading to incomplete memory for rare control flows.
- LLM hallucinations: Despite structured prompting and annotation, LLMs can propose non-existent widget actions when exposed to ambiguous screens.
- Latency/Token Cost in Long Tasks: Step-wise inference for complex tasks introduces noticeable delay and expense.
- Privacy Risks: Prompt construction potentially exposes sensitive information if UI metadata includes user data; fine-grained redaction remains an open challenge.
Proposed directions include coverage-driven exploration (e.g., RL-based), GUI validation/symbolic verification layers to suppress hallucinated outputs, hybrid local/cloud LLM scheduling for dynamic latency/cost-privacy trade-offs, and advanced planning to reuse partial sub-plans (Wen et al., 2023).
AutoDroid represents a paradigm shift in mobile UI automation, unifying automated dynamic analysis with LLM-centric reasoning while leveraging structured vector memory to enable robust, zero-shot execution of diverse, arbitrary app tasks (Wen et al., 2023).