ProgRM: Build Better GUI Agents with Progress Rewards (2505.18121v1)

Published 23 May 2025 in cs.AI, cs.CL, and cs.LG

Abstract: LLM-based (LLM) GUI (Graphical User Interface) agents can potentially reshape our daily lives significantly. However, current LLM-based GUI agents suffer from the scarcity of high-quality training data owing to the difficulties of trajectory collection and reward annotation. Existing works have been exploring LLMs to collect trajectories for imitation learning or to offer reward signals for online RL training. However, the Outcome Reward Model (ORM) used in existing works cannot provide finegrained feedback and can over-penalize the valuable steps in finally failed trajectories. To this end, we propose Progress Reward Model (ProgRM) to provide dense informative intermediate rewards by predicting a task completion progress for each step in online training. To handle the challenge of progress reward label annotation, we further design an efficient LCS-based (Longest Common Subsequence) self-annotation algorithm to discover the key steps in trajectories and assign progress labels accordingly. ProgRM is evaluated with extensive experiments and analyses. Actors trained with ProgRM outperform leading proprietary LLMs and ORM-trained actors, illustrating the effectiveness of ProgRM. The codes for experiments will be made publicly available upon acceptance.

Summary

Insights into ProgRM: Progress Reward Model for GUI Agents in Online RL

The paper "ProgRM: Build Better GUI Agents with Progress Rewards" presents an innovative approach targeting the improvement of GUI agents through reinforcement learning (RL) by utilizing a Progress Reward Model (ProgRM). The authors conceptualize a framework that aims to overcome the challenges associated with training LLM-based GUI agents, particularly focusing on the scarcity of high-quality training data and the inefficiencies of existing reward models in capturing granular task progress.

Key Contributions and Methodology

At the heart of the proposed framework is ProgRM, which provides dense intermediate rewards by estimating task progress at each step rather than strictly at the trajectory's conclusion. This contrasts with traditional Outcome Reward Models (ORM) that primarily consider the final success or failure of a task. The authors argue that ORM can over-penalize partial progress in trajectories that ultimately fail, thus reducing exploration efficiency in long-horizon tasks. ProgRM addresses this by predicting a task completion value for each step, improving learning stability and efficiency.

To effectively annotate progress labels, the authors introduce an LCS-based (Longest Common Subsequence) self-annotation algorithm. This algorithm automatically identifies key steps within trajectories by leveraging successful trajectories to extract common patterns, termed recipes, which are then used to annotate unseen trajectories with progress labels. This method circumvents the need for costly human experts and inefficient Monte-Carlo searches.

Experimental Results and Implications

The research undergoes extensive validation using the WikiHow task set, a real-world GUI interaction benchmark, demonstrating the superiority of ProgRM-trained actors over both proprietary models and ORM-based RL approaches. Specifically, actors trained with ProgRM exhibit improved success rates, especially in challenging in-page tasks where fine-grained actions are crucial. These results underline the efficacy of ProgRM in providing meaningful feedback that enhances RL training efficiency.

Moreover, assessments reveal that the progress model is adept at estimating partial task completion, which is instrumental for navigating complex environments. Such capability not only facilitates more robust training procedures but also opens pathways for future developments in online RL for GUI tasks.

Considerations and Potential Developments

While ProgRM demonstrates improvements in training GUI agents, certain aspects merit further exploration. The disparity between LCS-based and environment-reward-based progress labeling suggests potential refinement areas, particularly in automatic key step discovery algorithms. Additionally, extending the application of ProgRM to a broader scope of GUI environments and diverse interaction tasks could solidify its robustness and adaptability.

Moreover, the paper raises significant considerations regarding the responsible deployment of enhanced GUI agents. The automation capabilities afforded by improved agents entail risks of misuse or unintended consequences, emphasizing the importance of ethical guidelines and security measures in real-world implementations.

Conclusion

The development of ProgRM represents a significant advance in the learning efficiency and effectiveness of LLM-based GUI agents through targeted reinforcement learning strategies. By harnessing detailed progress signals, ProgRM enhances exploration and task learning in dynamically changing environments. The research indicates transformative potential in GUI agent training methodologies, propelling both practical applications and theoretical understanding of RL in artificial intelligence. As the field progresses, further innovations in reward model design and GUI agent applications are likely to emerge, driven by the insights and methodologies established in this paper.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers