GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents (2504.10458v3)

Published 14 Apr 2025 in cs.CV, cs.CL, and cs.HC

Abstract: Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-LLMs (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of LLMs in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

Summary

The paper introduces GUI-R1, a reinforcement fine-tuning framework that leverages rule-based rewards to improve LVLM performance on GUI tasks.
It unifies diverse platforms with a common action space and evaluates outputs using a combined format and accuracy reward function.
Experiments on eight benchmarks demonstrate significant gains in task success and grounding accuracy using only 3K curated examples.

This paper introduces GUI-R1, a novel framework for training GUI (Graphical User Interface) agents using reinforcement learning, specifically targeting the limitations of existing methods that rely on supervised fine-tuning (SFT). The authors argue that SFT requires large datasets and struggles with generalization to unseen interfaces and high-level tasks.

Core Idea:

GUI-R1 applies Reinforcement Fine-Tuning (RFT), inspired by recent successes in LLMs like DeepSeek-R1, to enhance the capabilities of Large Vision-LLMs (LVLMs) for GUI interaction tasks. Instead of learning directly from labeled action sequences (SFT), GUI-R1 uses a reward mechanism to guide the LVLM's learning process.

Methodology:

RFT Paradigm: The framework takes the current GUI screenshot ( $I$ ), a high-level task instruction ( $Q$ ), and the action history ( $H$ ) as input. The LVLM (policy model) generates multiple candidate responses ( $O$ ), each containing reasoning steps and a predicted action.
Unified Action Space: To handle diverse platforms (Windows, Linux, MacOS, Android, Web), a unified set of atomic actions is defined (e.g., click, type, scroll, complete, press_back). This allows consistent training across different data sources.
Verifiable Reward Function: A rule-based reward function evaluates each generated response. The total reward ( $R_o$ $R_{o}$ ) combines:
- Format Reward ( $R_f$ ): Checks if the output adheres to the specified format (e.g., using > and <answer> tags). > * Accuracy Reward ( $R_{acc}$ ): This is a sum of: > * Action Type Reward ( $R_{act}$ ): Binary reward (1/0) for matching the ground truth action type. > * Click Point Reward ( $R_{point}$ ): Binary reward (1/0) if the predicted click coordinates fall within the ground truth bounding box. > * Input Text Reward ( $R_{text}$ ): Binary reward (1/0) based on the semantic F1 score (>0.5) between predicted and ground truth text. > The final reward is $R_o = \alpha R_f + \beta R_{acc}$ . > > 4. Policy Optimization: The Group Relative Policy Optimization (GRPO) algorithm is used. It calculates the relative advantage ( $A_i$ ) for each response based on its reward compared to the mean and standard deviation of rewards from all generated responses for that input. This advantage guides the policy model updates. > > 5. Data Curation: A high-quality dataset, GUI-R1-3K, was curated. Starting with ~14M examples from various sources, they were filtered using a Qwen2.5VL-7B model and the reward function to select challenging yet learnable instances. This resulted in 1.5K high-level and 140K low-level examples. A balanced set of 3K examples (1.5K high-level + 1.5K sampled low-level) was created for efficient RFT. > > Implementation and Experiments: > > > * Models: QwenVL2.5-3B and QwenVL2.5-7B were used as base LVLMs. > > * Training: RFT was performed using the EasyR1 framework for 9 epochs. SFT comparisons used LLaMA Factory for 1 epoch. > > * Evaluation: Tested on 8 benchmarks covering mobile, desktop, and web platforms, assessing grounding, low-level tasks, and high-level tasks using metrics like action type accuracy (Type), grounding accuracy (GR), and step success rate (SR). > > * Results: > * GUI-R1 significantly outperformed previous SOTA methods (e.g., OS-Atlas) and SFT baselines across all benchmarks, despite using only 3K training examples (0.02% of the data used by OS-Atlas). > * For instance, on low-level tasks, GUI-R1-3B achieved an overall success rate of 80.88 compared to 55.65 for the base QwenVL2.5-3B and 65.79 for SFT on the same 3K data. Similar gains were observed for grounding and high-level tasks. > * Ablation studies confirmed the benefits of high-quality filtered data, higher image resolution, and weighting accuracy rewards more heavily than format rewards. > > Contributions: > > > 1. The first framework to apply rule-based RFT to enhance LVLMs for high-level GUI agent tasks. > > 2. A unified action space and corresponding rule-based reward function for cross-platform GUI evaluation. > > 3. The creation of GUI-R1-3K, a small but effective dataset for RFT. > > 4. Establishing strong benchmark results demonstrating the data efficiency and performance benefits of the RFT approach for GUI agents. > > In conclusion, GUI-R1 presents a data-efficient and effective alternative to SFT for training capable GUI agents, leveraging the power of reinforcement learning guided by rule-based rewards within a unified action space.

PDF Markdown

Tweets

https://twitter.com/dair_ai/status/1914674589818405243

https://twitter.com/shivanshpuri35/status/1915023593165168844

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents (2504.10458v3)

Summary

Related Papers

Tweets