Scalable and stable multi-turn reinforcement learning for GUI agents

Establish reinforcement learning techniques that remain stable and effective for GUI-centered agents in long-horizon interactive environments, addressing sparse or delayed rewards, optimization instability, and credit assignment across extended action sequences to enable consistent scaling beyond short-horizon demonstrations.

Background

The report emphasizes that multi-turn RL in interactive environments is difficult due to sparse/delayed rewards, instability, and long-horizon credit assignment, which hinder scaling and stable improvements. The proposed framework introduces asynchronous rollouts, streaming updates, and PPO enhancements, but the general challenge of scalable, stable multi-turn RL for GUI agents is identified as an open problem.

This problem motivates their training infrastructure and algorithmic choices (reward shaping, decoupled/length-adaptive GAE, value pretraining), underscoring the need for principled RL approaches that can reliably optimize long interactive trajectories.

References

While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability.

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning  (2509.02544 - Wang et al., 2 Sep 2025) in Abstract (Page 1)