AndroidGen: Building an Android Language Agent under Data Scarcity (2504.19298v1)

Published 27 Apr 2025 in cs.CL

Abstract: LLMs have opened up a world of possibilities for various NLP tasks, sparking optimism for the future. Despite their potential, LLMs have yet to be widely used as agents on real mobile devices. The main challenge is the need for high-quality data sources. Time constraints and labor intensity often hinder human annotation. On the other hand, existing LLMs exhibit inadequate completion rates and need a robust data filtration strategy. Given these challenges, we develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity. In addition, we leverage AndroidGen to collect trajectories given human tasks and train open-source LLMs on these trajectories to develop an open-source mobile agent without manually labeled trajectories. We extensively evaluate AndroidGen with AndroidWorld, AitW, and various popular applications, demonstrating its improvements and revealing potential areas for future improvement. Code, model, and data are available at https://github.com/THUDM/AndroidGen.

Authors (7)

Hanyu Lai (11 papers)
Junjie Gao (14 papers)
Xiao Liu (402 papers)
Yifan Xu (92 papers)
Shudan Zhang (7 papers)
Yuxiao Dong (119 papers)
Jie Tang (302 papers)

Summary

AndroidGen: Building an Android Language Agent under Data Scarcity

The paper "AndroidGen: Building an Android Language Agent under Data Scarcity" by authors from Tsinghua University and Zhipu AI focuses on the development of AndroidGen, a framework designed to enhance the capabilities of LLM-based agents in mobile environments where high-quality training data is scarce. The research addresses the challenges faced by LLMs in real-world applications on Android devices, highlighting the barriers to generalization and operational accuracy due to data scarcity.

Framework Components

AndroidGen consists of four primary modules that collectively aim to improve data collection, execution, and evaluation processes:

ExpSearch: This module employs in-context learning, allowing LLMs to enhance their performance by analyzing trajectories of successfully completed tasks. This approach facilitates the generalization of agent capabilities from simpler to more complex tasks, leveraging previously successful trajectories as examples.
ReflectPlan: By enabling self-reflection, ReflectPlan assists the agent in updating its strategies based on the current environment and execution history, thus reinforcing long-term reasoning abilities.
AutoCheck: This module serves as a proactive verification tool, checking each operation for potential errors before execution, thereby mitigating task failure risks due to incorrect operations.
StepCritic: Providing a detailed evaluation of trajectories, StepCritic divides tasks into sub-goals and assesses them step-by-step. This granular evaluation aids in constructing high-quality datasets crucial for robust model training.

Data Collection and Training

The authors leverage the AndroidGen framework to create a pipeline for generating extensive data without manual annotations. They employ StepCritic to filter and augment data, which then serves to fine-tune open-source LLMs such as GLM-4-9B and Llama-3-70B. Consequently, an open-source Android agent emerges from these LLMs, trained specifically on synthesized Android navigation trajectories.

Empirical Evaluation

The paper comprehensively evaluates AndroidGen against several benchmarks, including AndroidWorld, AitW, and popular applications, demonstrating the framework's improvements over existing systems. The results indicate that AndroidGen achieves notable advancements in reasoning capabilities and generalization abilities, outperforming baseline models in various test environments. For example, on AndroidWorld benchmarks, AndroidGen shows an increase in success rates across tasks of varying difficulty levels when compared to current solutions like M3A and SeeAct.

Implications and Future Directions

The research suggests that integrating LLMs as agents on Android devices can be significantly enhanced by employing strategic learning frameworks like AndroidGen. The implications extend beyond technical improvements to potential cost reductions in data collection and annotation. Moreover, the paper indicates future research directions focusing on refining algorithmic structures for better operational efficiency and exploring adaptive planning for complex environments.

Overall, the paper contributes to the growing body of literature that seeks to bridge the gap between LLM capabilities and practical applications on mobile platforms. It paves the way for developing more accessible, efficient, and competent mobile agents capable of autonomously handling diverse user tasks.

Related Papers

Find Related Papers

YouTube

Show All Videos