DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (2406.11896v1)

Published 14 Jun 2024 in cs.LG

Abstract: Training corpuses for vision LLMs (VLMs) typically lack sufficient amounts of decision-centric data. This renders off-the-shelf VLMs sub-optimal for decision-making tasks such as in-the-wild device control through graphical user interfaces (GUIs). While training with static demonstrations has shown some promise, we show that such methods fall short for controlling real GUIs due to their failure to deal with real-world stochasticity and non-stationarity not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline RL to initialize the model, followed by offline-to-online RL. To do this, we build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator and develop a simple yet effective RL approach for learning in this domain. Our approach runs advantage-weighted RL with advantage estimators enhanced to account for stochasticity along with an automatic curriculum for deriving maximal learning signal. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement -- from 17.7 to 67.2% success rate -- over supervised fine-tuning with static human demonstration data. These results significantly surpass not only the prior best agents, including AppAgent with GPT-4V (8.3% success rate) and the 17B CogAgent trained with AitW data (38.5%), but also the prior best autonomous RL approach based on filtered behavior cloning (57.8%), thereby establishing a new state-of-the-art for digital agents for in-the-wild device control.

Citations (15)

View on Semantic Scholar

Summary

The paper introduces DigiRL, a novel RL framework that combines offline and online phases to autonomously train agents for precise device control in dynamic GUI environments.
It leverages a value-enhanced advantage-weighted regression algorithm with automatic curriculum learning and doubly-robust advantage estimation to improve policy credit assignment.
Empirical results on the AitW dataset reveal success rates of 71.9% in General tasks and 67.2% in Web Shopping, outperforming supervised and prior VLM-based models.

Analyzing "DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning"

This paper introduces DigiRL, a novel autonomous reinforcement learning (RL) framework, that augments vision and LLMs (VLMs) to train agents for controlling devices through graphical user interfaces (GUIs) in-the-wild. The core motivation is derived from the inadequacies of existing vision-LLMs when tasked with in-the-wild device control due to the lack of domain-specific decision-making data and the dynamic nature of the environment, which traditional supervised and demonstration-based learning paradigms cannot sufficiently address.

Algorithm and Approach

The DigiRL framework builds upon a two-phased RL approach. Initially, it employs offline RL to initialize the model with existing stale data. Subsequently, it transitions to an offline-to-online RL phase, leveraging dynamically collected data. The environment for learning is a highly scalable and parallelizable Android emulator setup, equipped with a VLM-based evaluator capable of autonomous performance assessments and reward signaling.

DigiRL differentiates itself through a value-enhanced variant of advantage-weighted regression (AWR), which incorporates:

Automatic Curriculum Learning: It uses an instruction-level value function to dynamically prioritize tasks based on their difficulty and informativeness.
Doubly-Robust Advantage Estimation: This mitigates the variance involved by balancing Monte-Carlo returns with value estimates, refining credit assignment and policy updates.

The key modification from traditional AWR is the filtering of experiences using instruction-level values to build an automatic curriculum, alongside the training of a step-level value function to guide learning.

Performance Metrics

DigiRL's performance is rigorously assessed using the Android-in-the-Wild (AitW) dataset, specifically focusing on General and Web Shopping task subsets. Through comprehensive empirical evaluations:

The framework demonstrates a significant improvement, achieving success rates of 71.9% in the General subset and 67.2% in the Web Shopping subset.
These metrics mark a 49.5% absolute improvement over the best supervised model trained on human demonstration data and substantially outperform proprietary VLM-based agents augmented with AppAgent prompting.

Implications and Future Directions

The introduction of DigiRL paves the way for advanced RL-based methodologies in device control. The implications are profound:

Practical Implementations: It offers a robust solution for deploying adaptable and effective digital assistants capable of complex GUI navigation and task execution amidst real-world stochasticity.
Theoretical Insights: The reinforcement learning community benefits from the novel use of instruction-level values and doubly-robust advantage estimation, which can be extended or refined for other RL problems involving high variance and dynamic environments.

Future research can build upon DigiRL by integrating more sophisticated RL algorithms, exploring finer granulations of curriculum learning, and expanding the task scope to other device or application ecosystems beyond Android. Additionally, efforts can be made to further enhance the efficiency and scalability of the parallel emulator setup, as well as optimizing the autonomous evaluator for even broader task generalization.

In summary, DigiRL represents a substantial stride in the domain of autonomous RL for device control. Its methodical approach in integrating offline initialization with continuous online learning to adapt to real-world complexities sets a new benchmark, underscoring the potential of RL in creating highly capable and autonomous digital agents.

PDF Markdown

Related Papers

Tweets

https://twitter.com/aviral_kumar2/status/1803993289709797540

https://twitter.com/YifeiZhou02/status/1839094579871756782

https://twitter.com/jackbai_jkb/status/1804266117785112984

https://twitter.com/BigJedrzej/status/1809515703227031808

https://twitter.com/jackbai_jkb/status/1893402046298014077

https://twitter.com/metabolicHorror/status/1809503302620520824

YouTube

Show All Videos