GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents (2505.15810v2)

Published 21 May 2025 in cs.CL, cs.AI, and cs.CV

Abstract: Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at https://github.com/Yuqi-Zhou/GUI-G1.

Summary

Analysis of GUI-G1: Advancements in R1-Zero-Like Training for Visual Grounding in GUI Agents

The paper presents a comprehensive analysis and advancement in the field of GUI agents' training, particularly within the framework of the R1-Zero paradigm, which integrates reinforcement learning (RL) with explicit reasoning mechanisms for visual grounding tasks. The paper conducts a granular examination of three critical components of the R1-Zero-like training pipeline: input design, output evaluation, and policy update. Each is identified as a potential source of performance limitations when general-purpose RL methods are applied to GUI grounding tasks without adaptation.

Key Insights and Methodological Innovations

Input Design: The Fast Thinking Template

The authors contend that the prevailing emphasis on chain-of-thought reasoning within input design is counterproductive for GUI grounding tasks. They empirically demonstrate that increasing the length of reasoning chains degrades performance, particularly by misdirecting focus away from the fundamental visual elements essential for precise grounding. This informed the introduction of the Fast Thinking Template, a simplification that encourages concise outputs by circumventing unnecessary intermediate reasoning steps. This aligns with findings from prior studies suggesting reasoning mechanisms do not universally yield benefits across all tasks.

Output Evaluation: Addressing Reward Hacking

In the output evaluation phase, the paper highlights the contrasting tendencies of hit-based and IoU-based reward functions, both of which lead to divergent types of reward hacking. This manifests as models generating disproportionately small or large bounding boxes, thus compromising the balance between accuracy and overlap quality. To mitigate this, the researchers propose the introduction of a box-size constraint to the reward function, thereby regularizing the box size and harmonizing the exploitation of these reward types.

Policy Update: Mitigating Bias in GRPO

The researchers scrutinize the GRPO algorithm for policy updates, identifying two predominant biases: response-level length bias and query-level difficulty bias. The length bias arises from a predisposition towards preferring longer responses, especially when incorrect, which is addressed by normalizing the optimization step's token-wise impact. Meanwhile, the difficulty bias tends to shift focus towards easier samples, undermining the model’s capacity to learn from challenging cases. The authors propose an innovative difficulty-aware scaling factor to emphasize harder samples, thereby refining the learning process for complex examples.

Empirical Findings and Performance

Through the implementation of these innovations, the newly developed GUI-G1-3B model demonstrates remarkable performance improvements, achieving a 90.3% accuracy on the ScreenSpot benchmark and 37.1% on ScreenSpot-Pro, surpassing numerous prior models, including larger counterparts and those using conventional R1-style RL strategies. Notably, the model reaches these benchmarks with a significantly reduced dataset size of just 17K samples, underscoring the efficiency of its revised training methodology.

Implications and Future Directions

The implications of this research are multifaceted, both theoretically and practically. Theoretically, it challenges the applicability of chain-of-thought reasoning in visual grounding tasks and proposes more precise optimization strategies to overcome prevalent biases in RL training pipelines. Practically, these improvements can enhance the robustness of GUI agents in real-world applications, making them more effective in tasks requiring precise visual grounding without exhaustive computation.

Future research could extend these findings to other domains that utilize GUI agents, exploring the universality of these methods across diverse environments and interaction contexts. Additionally, expanding the scope of RL analysis beyond online reinforcement learning could yield further insights into training regimes and hyperparameter settings, ultimately advancing the generalization and robustness of multimodal learning models.

In conclusion, GUI-G1 represents a significant advancement in GUI agent training, providing a refined approach to R1-Zero-like learning that mitigates previous performance limitations through thoughtful methodological enhancements and empirical validation.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

GitHub

GitHub - Yuqi-Zhou/GUI-G1