UItron: Foundational GUI Agent with Advanced Perception and Planning

Published 29 Aug 2025 in cs.CV | (2508.21767v1)

Abstract: GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces UItron, a foundational GUI agent that integrates advanced perception and planning for both Mobile and PC platforms.
It employs a three-stage training paradigm combining perception tasks, planning with backtracking, and curriculum reinforcement learning to optimize performance.
Experimental results demonstrate UItron’s robust performance in grounding and complex planning scenarios, particularly in Chinese mobile applications.

UItron: Foundational GUI Agent with Advanced Perception and Planning

UItron is introduced as a comprehensive and open-source foundational model designed to advance the capabilities of GUI agents across various interactive environments, particularly focusing on Mobile and PC platforms. This essay examines the detailed methodologies, experimental results, and implications of UItron’s development.

Core Capabilities and System Architecture

UItron integrates advanced capabilities in GUI perception, grounding, and planning. As presented in the paper, these features enable UItron to understand complex user interfaces, accurately locate tasks, and execute action sequences across diverse scenarios. A highlight of UItron's system architecture is its reliance on a robust data engineering pipeline and a unified interactive infrastructure.

Figure 1: The core capabilities of GUI agent, including GUI perception, grounding, offline planning, and online planning.

UItron's architecture underscores the importance of systemic data engineering and an interactive environment, which are essential for automating GUI agent tasks. The infrastructure facilitates scalable data collection, enabling dynamic training and testing of UItron in real-world scenarios.

Data Engineering and Interactive Infrastructure

A critical challenge in developing GUI agents is the scarcity of annotated operation trajectories and the need for an interactive infrastructure. UItron addresses these challenges through an extensive data engineering approach:

Perception Data: Consolidates multi-turn conversations and integrates multi-task UI-related perception data to enhance understanding capabilities.
Planning Data: Introduces multi-level reasoning (L1-L3) and backtracking for enhanced action prediction, enabling the model to learn both forward planning and backtracking methodologies effectively.
Distillation Data: Implements an automated trajectory collection process for real scenarios, reducing manual annotation costs.
Manual Annotation: Focuses on expanding capabilities in Chinese application scenarios through extensive manual data collection from top-tier Chinese mobile applications.
Figure 2: Overall introduction of data engineering.

The interactive infrastructure connects both Mobile and PC devices, simplifying database creation and offering a realistic evaluation environment. This platform captures action outputs during training and evaluation, forming a basis for online reinforcement learning.

Figure 3: Overall introduction of interactive infrastructure.

Training Paradigm and Methodologies

UItron employs a three-stage training strategy:

Perception Tasks: Enhances GUI perception through tasks such as grounding, captioning, VQA, and OCR.
Planning Tasks: Utilizes outputs formatted via generative loss in an auto-regressive manner to capture next and historical actions, integrating reasoning capabilities through backtracking.
Curriculum Reinforcement Learning: Facilitates complex reasoning and exploration via group relative policy optimization (GRPO), leveraging dense rewards in offline and task-level rewards in online environments.
Figure 4: The overall architecture of Mobile infra.

Experimental Results

UItron's performance was validated across a range of benchmarks, demonstrating significant advancements in GUI perception, grounding, and planning capabilities:

GUI Perception: Outperformed state-of-the-art models in grounding tasks, establishing a strong base for subsequent planning stages.
Offline Planning: Demonstrated superior results in complex AndroidControl and GUI-Odyssey tasks, emphasizing its robustness in scenario generalization.
Online Planning: Achieved competitive results in OSWorld, a benchmark evaluating agent performance in real computer environments.
Chinese Scenario: Excelled in both offline and online environments, highlighting the efficacy of comprehensive data expansion in Chinese mobile applications.

UItron’s scalability and adaptability were apparent, with larger model versions performing consistently better across tasks due to enhanced data engineering.

Conclusion and Future Directions

UItron represents a significant step forward in GUI agent research, offering a powerful, open-source foundation for future development. By laying a strong emphasis on data-driven methodologies and interactive training environments, UItron bridges existing gaps in GUI agent capabilities, particularly for underrepresented languages and platforms. Future research could explore intrinsic thinking patterns in UItron’s decision-making and the potential for multi-agent cooperation, addressing limitations in handling both visual and textual elements comprehensively. Furthermore, integrations with advanced tool-use and coding capabilities might extend UItron’s application beyond traditional GUI environments.

Markdown Report Issue