Android in the Wild: A Large-Scale Dataset for Android Device Control (2307.10088v2)

Published 19 Jul 2023 in cs.LG, cs.CL, and cs.HC

Abstract: There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AITW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13),and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance. And, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset. The dataset is available at https://github.com/google-research/google-research/tree/master/android_in_the_wild.

PDF Abstract

Overview of "Android in the Wild: A Large-Scale Dataset for Android Device Control"

The paper "Android in the Wild: A Large-Scale Dataset for Android Device Control" introduces the Android in the Wild (AitW) dataset, one of the most extensive datasets developed for research in device-control systems based on interpreting human natural language for digital device manipulation.

Dataset Composition and Characteristics

The dataset encompasses 715,000 episodes across 30,000 unique instructions, effectively leveraging multiple Android versions (v10–13) and diverse device types. It includes interactions with both apps and web content, distinguishing it from prior datasets by its magnitude and variety. This dataset advances previous efforts by providing a pixel-based representation of interactions, which demands understanding actions inferred from visual elements rather than metadata-led UI representations.

Research Implications

Device Control System Challenges: The AitW dataset is designed to challenge device-control systems in understanding complex, multi-step tasks that require semantic interpretation of language and visual context. A significant emphasis is placed on gesture-based interactions within the UI, expanding the action space beyond mere UI component actions.

Generalizability: The dataset is organized to facilitate robust analysis regarding system performance amidst new task descriptions, applications, or Android platforms. By orchestrating tasks to include unseen subjects and verbs, and unseen Android versions and domains, the research highlights opportunities to explore generalization capabilities of AI models.

Methodology and Validation

From a methodological perspective, the authors employed a two-stage pipeline for data collection, enhancing episode diversity through task randomization and utilizing human oversight for meticulous annotation. The raters executed tasks under varying conditions to emulate realistic device interaction scenarios. An open-source platform, AndroidEnv, complements the dataset for comprehensive testing and development.

The paper also describes two agents utilized for benchmarking: a behavioral cloning agent and an LLM-based agent. Notably, the behavioral cloning agent achieved superior results, particularly when evaluated across several out-of-distribution (OOD) contexts. Evaluation leveraged both partial and complete action matching metrics, augmented by human validation in specific subsets to confirm offline assessment integrity.

Experimental Findings

The experiments demonstrate robust system performance under standard conditions, with behavioral cloning agents showing a notable ability to generalize across unseen language instructions and tasks. The LLM-based models, while effective, were constrained by their limited ability to produce arbitrary <x,y> gesture actions, emphasizing the potential of multimodal models for future research.

Conclusions and Future Directions

The dataset and findings lay the groundwork for future development of adaptive, versatile device-control systems. The introduction of the AitW dataset pushes boundaries in AI's ability to interact with dynamic, visual-spatial environments akin to human patterns. The paper identifies several areas for future exploration, including integrating multimodal modeling techniques and refining evaluation mechanisms for increased task execution flexibility. Furthermore, ethical considerations around privacy and potential misuse are acknowledged, ensuring responsible research and deployment of the models developed using this dataset. In summary, the AitW dataset marks a progressive step in aligning AI systems closer to human-like device interaction through natural language.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Christopher Rawles (3 papers)
Alice Li (9 papers)
Daniel Rodriguez (13 papers)
Oriana Riva (11 papers)
Timothy Lillicrap (60 papers)

Citations (88)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/thecrawles/status/1794031309783740522