- The paper introduces DigiRL, a novel RL framework that combines offline and online phases to autonomously train agents for precise device control in dynamic GUI environments.
- It leverages a value-enhanced advantage-weighted regression algorithm with automatic curriculum learning and doubly-robust advantage estimation to improve policy credit assignment.
- Empirical results on the AitW dataset reveal success rates of 71.9% in General tasks and 67.2% in Web Shopping, outperforming supervised and prior VLM-based models.
Analyzing "DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning"
This paper introduces DigiRL, a novel autonomous reinforcement learning (RL) framework, that augments vision and LLMs (VLMs) to train agents for controlling devices through graphical user interfaces (GUIs) in-the-wild. The core motivation is derived from the inadequacies of existing vision-LLMs when tasked with in-the-wild device control due to the lack of domain-specific decision-making data and the dynamic nature of the environment, which traditional supervised and demonstration-based learning paradigms cannot sufficiently address.
Algorithm and Approach
The DigiRL framework builds upon a two-phased RL approach. Initially, it employs offline RL to initialize the model with existing stale data. Subsequently, it transitions to an offline-to-online RL phase, leveraging dynamically collected data. The environment for learning is a highly scalable and parallelizable Android emulator setup, equipped with a VLM-based evaluator capable of autonomous performance assessments and reward signaling.
DigiRL differentiates itself through a value-enhanced variant of advantage-weighted regression (AWR), which incorporates:
- Automatic Curriculum Learning: It uses an instruction-level value function to dynamically prioritize tasks based on their difficulty and informativeness.
- Doubly-Robust Advantage Estimation: This mitigates the variance involved by balancing Monte-Carlo returns with value estimates, refining credit assignment and policy updates.
The key modification from traditional AWR is the filtering of experiences using instruction-level values to build an automatic curriculum, alongside the training of a step-level value function to guide learning.
Performance Metrics
DigiRL's performance is rigorously assessed using the Android-in-the-Wild (AitW) dataset, specifically focusing on General and Web Shopping task subsets. Through comprehensive empirical evaluations:
- The framework demonstrates a significant improvement, achieving success rates of 71.9% in the General subset and 67.2% in the Web Shopping subset.
- These metrics mark a 49.5% absolute improvement over the best supervised model trained on human demonstration data and substantially outperform proprietary VLM-based agents augmented with AppAgent prompting.
Implications and Future Directions
The introduction of DigiRL paves the way for advanced RL-based methodologies in device control. The implications are profound:
- Practical Implementations: It offers a robust solution for deploying adaptable and effective digital assistants capable of complex GUI navigation and task execution amidst real-world stochasticity.
- Theoretical Insights: The reinforcement learning community benefits from the novel use of instruction-level values and doubly-robust advantage estimation, which can be extended or refined for other RL problems involving high variance and dynamic environments.
Future research can build upon DigiRL by integrating more sophisticated RL algorithms, exploring finer granulations of curriculum learning, and expanding the task scope to other device or application ecosystems beyond Android. Additionally, efforts can be made to further enhance the efficiency and scalability of the parallel emulator setup, as well as optimizing the autonomous evaluator for even broader task generalization.
In summary, DigiRL represents a substantial stride in the domain of autonomous RL for device control. Its methodical approach in integrating offline initialization with continuous online learning to adapt to real-world complexities sets a new benchmark, underscoring the potential of RL in creating highly capable and autonomous digital agents.