GUI-Owl Automation Model
- GUI-Owl is a comprehensive GUI automation model that unifies visual perception, natural language grounding, and procedural planning across diverse environments.
- It employs self-evolving trajectory data generation and trajectory-aware reinforcement learning to optimize performance on mobile and desktop benchmarks.
- The open-source model supports both standalone and multi-agent frameworks, enabling advances in automated testing, robotic process automation, and dynamic task planning.
The GUI-Owl model is a foundational large-scale agent for graphical user interface (GUI) automation. It unifies visual perception, natural language grounding, procedural planning, advanced reasoning, and action execution within one architecture and is designed for general-purpose automation across desktop and mobile platforms. GUI-Owl achieves leading benchmark scores and offers robust modular integration for both single-agent and multi-agent frameworks. Its core innovations include large-scale environment infrastructure for self-evolving trajectory data generation, diverse foundational agent capabilities, and scalable reinforcement learning via trajectory-aware optimization. GUI-Owl, together with frameworks like Mobile-Agent-v3, is open-sourced for further research and deployment.
1. Architecture and Core Innovations
GUI-Owl is architected as an end-to-end interactive agent operating over GUI environments. It unifies several foundational capabilities:
- Visual Perception: Processes raw screenshots, element layouts, and context information from Android, Ubuntu, macOS, and Windows virtual environments.
- Natural Language Grounding: Incorporates advanced instruction parsing and semantic linking to UI elements.
- Planning and Action Semantics: Decouples complex GUI tasks into subgoals, integrating procedural knowledge for before–after state transitions.
- Reasoning Patterns: Utilizes hint-guided rejection sampling and multi-agent distillation for multi-turn reflection and error correction through specialized agents (e.g., Reflector, Notetaker).
A significant innovation is the Self-Evolving GUI Trajectory Production framework. This system automates high-quality data generation via a cloud-based infrastructure covering diverse operating systems. It supports automated query generation (using annotated navigation graphs and LLM-based instruction synthesis), self-rollout of the agent in dynamic virtual environments, and two-tiered trajectory correctness judgment (step-level and trajectory-level). These mechanisms enable iterative refinement of interaction trajectories, resulting in a self-reinforcing data generation loop with reduced manual annotation requirements.
2. Performance Metrics and Benchmarks
GUI-Owl sets state-of-the-art records among open-source models on ten public benchmarks spanning mobile and desktop domains.
Model | Parameters | AndroidWorld | OSWorld | OSWorld-RL (TRPO) | OSWorld-Verified |
---|---|---|---|---|---|
GUI-Owl-7B | 7B | 66.4 | 29.4 | 34.9 | - |
Mobile-Agent-v3 | 7B | 73.3 | 37.7 | - | 37.7 |
GUI-Owl demonstrates robust performance with 66.4 on AndroidWorld and 29.4 on OSWorld for the 7B model. RL-tuned variants reach 34.9 on OSWorld. When integrated into the multi-agent Mobile-Agent-v3 framework, scores increase to 73.3 on AndroidWorld and 37.7 on OSWorld. These benchmarks encompass grounding (UI element identification), question answering, planning, decision-making, and procedural knowledge assessment.
3. Scalable Reinforcement Learning: Trajectory-aware Relative Policy Optimization
GUI-Owl leverages a scalable RL framework with fully asynchronous rollout and update design. A technical centerpiece is Trajectory-aware Relative Policy Optimization (TRPO), which stabilizes multi-turn policy learning in long-horizon, variable-length GUI sequences.
- Advantage per trajectory:
where is the cumulative trajectory reward, is the mean over the batch, and is the standard deviation.
- Policy loss across the trajectory tokens:
This formulation normalizes trajectory-level rewards to every token and uses a clipped ratio to stabilize off-policy learning.
4. Modular Multi-Agent Integration and System Design
GUI-Owl can function as a standalone agent or be embedded within multi-agent frameworks. The Mobile-Agent-v3 integration organizes roles as:
- Manager Agent: Instruction decomposition into subgoals.
- Worker Agent: Execution of concrete GUI actions.
- Reflector Agent: Trajectory assessment and correction feedback.
- Notetaker Agent: Transient information recording for long-term coordination.
A formalized state representation includes device state , subgoal lists, action tuple , reflection feedback, and note records. The system iterates through plan updates and action executions by agent collaboration until completion or timeout.
5. Self-Evolving Data Generation and Validation
GUI-Owl’s data pipeline is anchored on a self-evolving loop, producing validated high-quality interaction data:
- Query Generation: Synthesizes navigation tasks using expert graphs and automated instruction expansion.
- Trajectory Rollout: Agent performs interactions in virtualized environments per generated queries.
- Correctness Validation: Dual-level step and trajectory validators assess progress toward completion; guidance modules offer actionable improvements.
- Multi-Domain Coverage: Data spans Android, Ubuntu, macOS, and Windows, supporting robust cross-platform generalization.
This automated trajectory production approach allows for continuous improvement, reduces annotation effort, and supports real-world environment alignment.
6. Applications and Use Cases
GUI-Owl’s capabilities enable a range of automation tasks:
- Automated Testing: End-to-end validation and regression testing across platforms.
- Robotic Process Automation: Bulk data entry, routine GUI operations, and customer support interactions.
- Task Planning: Multi-step plan synthesis, dynamic adaptation to GUI state transitions, long-horizon task completion.
- Distributed Autonomous Systems: Integration into intelligent assistants using multi-agent role sharing across devices.
Multi-agent modularity facilitates coordinated multi-turn operations and error correction, enhancing performance on complex, long-horizon interactions.
7. Open Source Availability and Future Directions
GUI-Owl and Mobile-Agent-v3 are released open-source via the X-PLUG/MobileAgent repository (Ye et al., 21 Aug 2025), supporting reproducibility and further research.
Future work, as suggested by indicators in the source material, will likely focus on scaling environment infrastructure, refining asynchronous RL, expanding multi-agent modularity, and continual improvement via self-evolving data loops. A plausible implication is broader adoption for industrial-scale automation, increased research into efficient trajectory validation, and further advances in distributed intelligent GUI agents.
Editor’s term: “GUI-Owl Model” is used here as a unifying reference for foundational, multi-capability GUI agents designed for desktop and mobile automation, as defined and evaluated in (Ye et al., 21 Aug 2025).