Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents (2410.24024v2)

Published 31 Oct 2024 in cs.AI
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Abstract: Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both LLMs and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.

AndroidLab provides a systematic framework for the development, training, and evaluation of autonomous agents designed to interact with Android environments (Xu et al., 31 Oct 2024 ). It addresses the lack of standardized tooling and benchmarks for comparing both open-source and closed-source models, particularly LLMs and Large Multimodal Models (LMMs), operating within the Android ecosystem. The framework comprises an operational environment, a defined action space, a reproducible benchmark suite, and a specialized dataset for instruction tuning.

AndroidLab Environment

The core of AndroidLab is its operational environment, designed to facilitate agent interaction with Android Virtual Devices (AVDs). This environment provides agents with multimodal observational data and accepts actions within a standardized space.

Modalities and Observation Space

The environment exposes multiple modalities to the agent to perceive the device state comprehensively. This includes:

  1. Screen Pixels: Raw visual information from the device screen.
  2. View Hierarchy: Structural information of the UI elements currently displayed, typically represented as an XML-like structure. This provides crucial context about interactable elements, their properties (e.g., resource ID, text content, bounding box), and their relationships.
  3. Task Description: Natural language instructions specifying the goal the agent needs to achieve.

This multimodal input stream allows agents, particularly LMMs, to leverage both visual and structural information for decision-making.

Action Space

A key contribution is the unified action space designed to be compatible with both LLMs and LMMs. This space abstracts common user interactions on Android devices. The specific actions supported include operations like tapping on UI elements identified by their properties (e.g., text content, resource ID, or coordinates derived from the view hierarchy or visual grounding), inputting text into fields, swiping, and system-level actions (e.g., pressing the back or home button). Representing actions in a structured format (e.g., JSON or function calls) allows different model architectures to generate executable commands within the same framework. The design aims for reproducibility and simplifies the process of adapting diverse models to the Android interaction task.

Android Virtual Devices (AVDs)

AndroidLab utilizes pre-configured AVDs as the execution backend. This ensures a controlled and reproducible environment. The framework includes setup scripts and configurations for these AVDs, which host the applications used in the benchmark tasks. Using AVDs allows for parallel execution and isolation, facilitating large-scale experimentation and benchmarking.

AndroidLab Benchmark

The framework includes a reproducible benchmark consisting of 138 tasks distributed across nine common Android applications. These applications are pre-installed on the provided AVD configurations.

Task Design

The tasks are designed to cover a range of typical user interactions and complexities, from simple navigation and information retrieval to more complex multi-step operations involving data entry and manipulation. Examples include tasks like "Send an email with subject X and body Y," "Set a reminder for time T," or "Find directions from location A to location B." Each task is defined by a natural language instruction.

Evaluation Metrics

The primary evaluation metric is the Success Rate (SR), which measures the percentage of tasks completed successfully by the agent. Task completion is typically determined by checking if the final state of the application or device matches the state expected upon successful execution of the given instruction. The benchmark infrastructure provides mechanisms for automated evaluation based on predefined success criteria for each task.

Agent Training and Dataset

Recognizing the performance gap of existing models on Android interaction tasks, AndroidLab facilitates agent training through a custom dataset.

Android Instruction Dataset

An "Android Instruction" dataset was curated using the AndroidLab environment. This dataset comprises trajectories of interactions, where each step includes the multimodal observation (screen, view hierarchy), the natural language instruction, and the corresponding ground-truth action taken to progress towards the task goal. This dataset is specifically designed for instruction-tuning LLMs and LMMs to improve their ability to map instructions and device states to appropriate actions within the Android environment.

Training Methodology

The paper demonstrates the effectiveness of this dataset by fine-tuning several open-source models. Six models (both LLMs and LMMs) were trained using the Android Instruction dataset. The training objective is typically to maximize the likelihood of predicting the correct action given the instruction and the current state observation. Standard supervised fine-tuning techniques are employed.

Experimental Results and Findings

Systematic benchmarking was performed using the AndroidLab framework, evaluating both pre-trained and fine-tuned models.

Baseline Performance

Initial evaluations revealed low success rates for pre-trained, off-the-shelf models on the benchmark tasks. Average success rates were reported as 4.59% for the evaluated LLMs and 1.93% for the LMMs. This highlights the challenge of applying general-purpose models directly to complex, goal-oriented interaction tasks within the Android GUI paradigm without specific adaptation.

Post-Training Performance Improvement

Significant performance improvements were observed after fine-tuning the open-source models on the Android Instruction dataset.

  • The average success rate for the fine-tuned LLMs increased from 4.59% to 21.50%.
  • The average success rate for the fine-tuned LMMs increased from 1.93% to 13.28%.

These results demonstrate the efficacy of the curated dataset and the fine-tuning process within the AndroidLab framework for enhancing agent capabilities in Android interaction. Although LMMs started from a lower baseline, the relative improvement suggests their potential, possibly requiring further optimization or architectural adaptation for better utilization of visual cues in conjunction with structural information. The framework also supports evaluating closed-source models (e.g., via APIs), allowing for broader comparisons, although the paper primarily focused on improvements achieved through fine-tuning open-source models.

Conclusion

AndroidLab offers a valuable contribution by providing an open-source, systematic framework and benchmark for Android autonomous agents (Xu et al., 31 Oct 2024 ). Its standardized environment, unified action space, diverse task suite, and curated training dataset facilitate reproducible research and development in this area. The demonstrated significant improvement in model performance after fine-tuning underscores the importance of specialized training data and methodologies for enabling effective agent interaction with mobile GUIs. The framework serves as a robust platform for future research in Android agents, supporting the evaluation and comparison of a wide range of AI models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Anthropic. 2023. Introducing claude.
  2. Program synthesis with large language models.
  3. Qwen technical report.
  4. Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. arXiv preprint arXiv:2104.08560.
  5. Evaluating large language models trained on code.
  6. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces.
  7. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070.
  8. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793.
  9. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models.
  10. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856.
  11. Cogagent: A visual language model for gui agents.
  12. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553.
  13. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649.
  14. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. arXiv preprint arXiv:2404.03648.
  15. Benchmarking mobile device control agents across diverse configurations.
  16. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
  17. Mapping natural language instructions to mobile UI action sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8198–8210, Online. Association for Computational Linguistics.
  18. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR).
  19. Webglm: Towards an efficient web-enhanced question answering system with human preferences. arXiv preprint arXiv:2306.07906.
  20. Gaia: a benchmark for general ai assistants.
  21. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  22. OpenAI. 2023. Gpt-4 technical report.
  23. Revisiting, benchmarking and exploring api recommendation: How far are we?
  24. Robotic process automation: the virtual workforce. International Journal on Future Revolution in Computer Science & Communication Engineering, 5(2):28–32.
  25. Androidworld: A dynamic benchmarking environment for autonomous agents.
  26. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088.
  27. Robotic process automation: A case study in the banking industry. In 2019 14th Iberian Conference on information systems and technologies (CISTI), pages 1–6. IEEE.
  28. Meta-gui: Towards multi-modal conversational agents on mobile gui.
  29. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
  30. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  31. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231.
  32. Ugif: Ui grounded instruction following.
  33. Enabling conversational interaction with mobile ui using large language models.
  34. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191.
  35. Cogvlm: Visual expert for pretrained language models.
  36. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.
  37. Understanding the weakness of large language model agents within a complex android environment. arXiv preprint arXiv:2402.06596.
  38. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation.
  39. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.
  40. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771.
  41. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
  42. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  43. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  44. Zhuosheng Zhan and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
  45. Naturalcodebench: Examining coding performance mismatch on humaneval and natural user prompts. arXiv preprint arXiv:2405.04520.
  46. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614.
  47. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568.
  48. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yifan Xu (92 papers)
  2. Xiao Liu (402 papers)
  3. Xueqiao Sun (3 papers)
  4. Siyi Cheng (3 papers)
  5. Hao Yu (195 papers)
  6. Hanyu Lai (11 papers)
  7. Shudan Zhang (7 papers)
  8. Dan Zhang (171 papers)
  9. Jie Tang (302 papers)
  10. Yuxiao Dong (119 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com