MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment (2507.05720v1)

Published 8 Jul 2025 in cs.LG and cs.CL

Abstract: Recently, there has been a surge of vision-based GUI agents designed to automate everyday mobile and web tasks. These agents interpret raw GUI screenshots and autonomously decide where to click, scroll, or type, which bypasses handcrafted rules and app-specific APIs. However, most existing methods trained GUI agent in the offline environment using pre-collected trajectories. This approach limits scalability, causes overfitting to specific UI templates, and leads to brittle policies when faced with unseen environment. We present MobileGUI-RL, a scalable framework that trains GUI agent in online environment. MobileGUI-RL contains two key components. It (i) synthesizes a curriculum of learnable tasks through self-exploration and filtering, and (ii) adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards that balance task success and execution efficiency. Experiments on three online mobile-agent benchmarks show consistent gains, validating the effectiveness of our approach.

Summary

The paper introduces a scalable online RL framework that trains mobile GUI agents through real-time interactions, overcoming limitations of static datasets.
It employs synthetic task generation, curriculum filtering, and trajectory-aware policy optimization with a composite reward to enhance performance.
Empirical results reveal significant success rate gains across benchmarks, demonstrating robust performance and scalability in mobile automation.

MobileGUI-RL: Reinforcement Learning for Robust Mobile GUI Agents in Online Environments

MobileGUI-RL presents a comprehensive framework for training mobile GUI agents via reinforcement learning (RL) in online, interactive environments. The work addresses the limitations of prior approaches that rely on offline, static datasets and supervised fine-tuning, which often result in overfitting to specific UI templates and poor generalization to novel or dynamic interfaces. By leveraging online RL, MobileGUI-RL enables agents to adaptively learn from real-time interactions, thereby improving robustness and scalability in practical mobile automation scenarios.

Framework Overview

The core contributions of MobileGUI-RL are threefold:

Scalable Online Training Environment: The framework orchestrates batched, asynchronous execution of multiple Android emulators, allowing for high-throughput collection of diverse interaction trajectories. This design decouples environment simulation (CPU-bound) from model training (GPU-bound), facilitating efficient resource utilization and large-scale rollout.
Synthetic Task Generation and Curriculum Filtering: To overcome the scarcity and skew of real-world task distributions, MobileGUI-RL introduces a self-exploration mechanism that generates candidate tasks by performing guided random walks through the GUI. These trajectories are reverse-engineered into natural language instructions using LLMs (e.g., GPT-4o). A lightweight, text-based world model then filters these tasks, ensuring that only feasible and unambiguous tasks are included in the training curriculum. This process not only increases task diversity but also implicitly orders tasks by complexity, supporting curriculum learning.
Trajectory-Aware RL with MobGRPO: The framework adapts Group Relative Policy Optimization (GRPO) to the GUI domain, introducing trajectory-level advantage estimation and a composite reward function. Rewards are computed at the trajectory level, incorporating both binary task success and an exponential decay factor to incentivize efficiency. Penalties for premature termination further discourage degenerate behaviors. This design addresses the challenges of sparse rewards and long-horizon credit assignment inherent in GUI navigation tasks.

Implementation Details

Agent Architecture: The agent is instantiated as a vision-LLM (Qwen2.5-VL-7B/32B), receiving as input the current screenshot, task instruction, and interaction history. Actions are generated via a structured tool-use interface, supporting taps, swipes, text input, system navigation, and control commands.
Reward Evaluation: A large VLM oracle (e.g., Qwen2.5-VL-72B) serves as a unified evaluator, providing binary success/failure signals based on the final trajectory and instruction. This abstracts away low-level UI details and enables scalable, task-agnostic supervision.
Training Regimen: The environment supports dynamic scaling of emulator instances, with each agent rollout capped at a maximum episode length. Hyperparameters are tuned for stability and efficiency, including trajectory-based normalization of advantages and careful balancing of reward components.

Empirical Results

MobileGUI-RL demonstrates strong empirical performance across three challenging mobile GUI agent benchmarks: AndroidWorld (AW), Android-in-the-Wild General Tasks (AITW-Gen), and Android-in-the-Wild WebShop (AITW-Web). Notable findings include:

Substantial Gains Over Baselines: MobileGUI-7B improves the success rate (SR) on AITW-Gen from 49.0% (base) to 65.3%, and MobileGUI-32B achieves 44.8% SR on AW, outperforming both its base model (31.5%) and larger models such as Qwen2.5-VL-72B (35.0%) and GPT-4o (34.5%).
Ablation Studies: Removal of task filtering, curriculum learning, or the decaying reward component each leads to significant performance degradation, underscoring the necessity of these design choices. Curriculum learning, in particular, yields a 10.8-point improvement for the 32B model on AW.
Sample Efficiency and Generalization: The curriculum-based approach enables the agent to master basic skills before tackling more complex tasks, resulting in higher final performance and better generalization to unseen scenarios.

Implications and Future Directions

Practical Implications:

The framework's ability to train robust, generalist GUI agents without reliance on handcrafted rules or app-specific APIs has direct applications in mobile automation, accessibility, and device testing.
The scalable, asynchronous environment design is well-suited for industrial-scale deployment, supporting continual adaptation as mobile ecosystems evolve.

Theoretical Implications:

The trajectory-level RL formulation and composite reward design provide a principled solution to the credit assignment and sparse reward problems in long-horizon, multimodal tasks.
The use of LLM-based world models for task filtering and curriculum construction offers a generalizable approach for other domains where environment simulation is costly or infeasible.

Future Research Directions:

Task Complexity and Hierarchical Decomposition: Extending the curriculum to include more realistic, long-horizon tasks, potentially via human-in-the-loop curation or hierarchical task generation.
Granular Reward Signals: Incorporating step-wise or subgoal-based rewards to further improve credit assignment and learning efficiency.
Visual World Models and Lookahead Planning: Developing predictive models of GUI state transitions to enable multi-step planning and error correction.
On-Device Continual Adaptation: Enabling agents to personalize and adapt to individual user preferences through continual learning on-device.

Conclusion

MobileGUI-RL establishes a new paradigm for training mobile GUI agents via online reinforcement learning, combining scalable environment simulation, automated curriculum construction, and trajectory-aware policy optimization. The framework achieves state-of-the-art results on multiple benchmarks and provides a robust foundation for future research in adaptive, generalist GUI agents. The modularity and scalability of the approach make it highly relevant for both academic research and real-world deployment in mobile automation and human-computer interaction.

PDF Markdown

Related Papers

Tweets

https://twitter.com/wyu_nd/status/1942776775064572255

https://twitter.com/fly51fly/status/1944148279551471744

https://twitter.com/arxivsanitybot/status/1943143542223761658