Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay (2505.16282v1)

Published 22 May 2025 in cs.CV

Abstract: Training LLMs as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long-horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi-turn reinforcement learning (RL) for reasoning and tool-using capabilities in LLMs, their application to GUI-based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy-based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM-based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi-turn, vision-language GUI agents capable of managing complex real-world UI interactions. Codes and models:https://github.com/dvlab-research/ARPO.git.

Summary

  • The paper introduces ARPO, an RL method that enhances GUI agent training through valuable task selection and experience replay.
  • It achieves significant success improvements, with a 29.9% success rate on standard OSWorld benchmarks and enhanced sample efficiency.
  • A distributed rollout system drastically reduces training time, proving ARPO's practical impact for real-world GUI interactions.

This paper, "ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay" (2505.16282), addresses the challenge of training LLMs to act as agents capable of interacting with graphical user interfaces (GUIs) using reinforcement learning (RL). Training such agents is difficult due to the long-horizon nature of tasks, sparse and delayed rewards from GUI environments, multimodal feedback (screenshots and text), and high computational costs associated with rolling out trajectories in real desktop environments.

The core contribution is Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach designed specifically for vision-language-based GUI agents. ARPO builds upon Group Relative Policy Optimization (GRPO) (2402.03300) and enhances it with two key components to improve training stability and sample efficiency in the challenging GUI domain:

  1. Valuable Task Selection: A strategy to filter training tasks based on the success rate of a baseline agent. This curates a subset of tasks that are more likely to yield successful rollouts, ensuring meaningful reward variance within GRPO training groups from early stages.
  2. Experience Replay Buffer: A buffer that stores successful trajectories on a per-task basis. When a GRPO training group consists entirely of failed trajectories (zero reward), one is replaced with a successful one from the buffer for the same task. This guarantees at least one positive reward signal in groups that would otherwise have uniform zero rewards, preventing vanishing gradients and stabilizing training.

The agent architecture is based on vision-LLMs (VLMs), specifically adapting the UI-Tars (2501.12326) framework and Qwen2.5-VL (2502.13923). The model is enhanced with a longer context window (up to 64K tokens) and the ability to process sequences of images (up to 15) to capture the full history of GUI interactions. The agent generates actions in a Chain-of-Thought (CoT) format [wei2022chain], comprising a reasoning part and an executable solution part, using a predefined action space including clicks, scrolls, typing, hotkeys, and meta-actions like WAIT, FINISH, FAIL, and CALL_USER.

For efficient training, the authors employ a distributed trajectory rollout system. This system utilizes parallel virtual environments (like OSWorld [xie2024osworld]) and a centralized VLM inference server. Screenshots are captured from multiple environments and batched for parallel processing by the VLLM server [kwon2023efficient], reducing per-step latency and maximizing GPU utilization, which is critical given the inherent delays in real OS interactions.

The reward design for policy optimization uses a combination of:

  • Trajectory Reward: A binary reward (1 for successful task completion, 0 otherwise) provided by the OSWorld environment.
  • Action Format Reward: A penalty (-1) if the agent's generated action cannot be parsed according to the defined schema, encouraging syntactically correct outputs.

The objective is to maximize the sum of these rewards over tasks and trajectories using the GRPO objective, which computes token-level advantages based on the mean and standard deviation of rewards within a group of rollouts.

Implementation Details:

  • Base Model: UI-Tars-1.5 7B model.
  • Training Framework: VERL (2409.19256).
  • Environments: 256 parallel virtual environments using OSWorld.
  • Rollouts: 8 rollouts per task, batch size 32, temperature 1.0 for exploration.
  • Tasks: 128 tasks selected from OSWorld using the filtering strategy.
  • Training: 15 epochs.
  • Optimizer: AdamW [2017decoupled] with learning rate 1×1061 \times 10^{-6}.
  • Mini-batch size: 8 per device, gradient accumulation 4.
  • GRPO parameters: ϵlow=0.2\epsilon_{\text{low}} = 0.2, ϵhigh=0.3\epsilon_{\text{high}} = 0.3. KL divergence loss is removed.
  • Evaluation: Temperature 0.6, maximum 15 steps per trajectory. Two metrics: OSWorld (standard) and OSWorld Hard (stricter, no final FAIL action replacement).

Experimental Results:

  • ARPO significantly improves the success rate of the base UI-Tars-1.5 model on OSWorld, achieving 29.9\% on the standard setting and 23.8\% on OSWorld Hard, representing improvements of 6.4\% and 5.6\% respectively over the original model.
  • The ablation paper shows that the experience replay buffer substantially increases average trajectory reward during training (0.75 vs 0.65) and boosts the in-domain success rate (81.25\% with ARPO vs 68.8\% with GRPO).
  • Valuable task selection is crucial for training stability and reward variance, leading to higher average rewards and faster convergence compared to training on the full task set.
  • RL training (both GRPO and ARPO) achieves significant gains on in-domain tasks but shows only moderate improvements on out-of-domain tasks, suggesting that strong generalization requires more data diversity or other techniques.
  • Compared to offline preference optimization methods (Reject Sampling, DPO [2023direct], KTO [2024kto]), ARPO achieves the highest success rate (27.3\%), outperforming GRPO (26.0%) and preference-based methods (KTO 24.6%, DPO 22.4%, Reject Sampling 21.8%). This indicates that direct policy optimization with rule-based rewards is more effective in this domain.
  • The distributed rollout system demonstrates significant efficiency gains, reducing epoch time from over 6 hours to around 1.2 hours by scaling to 256 parallel environments.
  • Qualitative analysis shows ARPO-trained agents exhibiting self-correction behaviors based on visual feedback.

Practical Implications:

ARPO provides a practical end-to-end RL framework for training GUI agents, addressing key challenges like sparse rewards and high rollout costs. The distributed rollout system is essential for making RL feasible in interactive desktop environments. The task selection and experience replay mechanisms offer concrete ways to improve sample efficiency and training stability in sparse-reward settings common in real-world applications. This research suggests that VLM agents can learn complex multi-turn GUI tasks directly from environment feedback, moving beyond pure imitation learning.

Implementation Considerations:

  • Requires a powerful VLM architecture capable of processing long sequences of images and text.
  • Demands a sophisticated distributed system for parallel environment rollouts and centralized VLLM inference.
  • Relies on a reward function (even if sparse) that can be obtained programmatically from the environment.
  • The task selection strategy requires an initial evaluation phase with a baseline agent.
  • The experience replay buffer needs careful management (size limits, eviction policies) to prevent divergence from the current policy.
  • Generalization to truly novel or complex out-of-domain tasks remains a challenge, suggesting the need for broader task coverage or different training strategies.

Overall, the paper presents a significant step towards building more capable and adaptive GUI agents using reinforcement learning, providing practical techniques to overcome common obstacles in this domain.