Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 35 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 106 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 228 tok/s Pro
2000 character limit reached

GUI-Owl: Unified GUI Automation

Updated 23 August 2025
  • GUI-Owl is a multimodal agent for GUI automation that integrates vision-language perception, contextual grounding, and multi-turn planning in a unified policy network.
  • It employs a self-evolving data synthesis pipeline and asynchronous reinforcement learning to generate and validate diverse interaction trajectories across desktop and mobile platforms.
  • GUI-Owl establishes new benchmarks in both AndroidWorld and OSWorld, enabling scalable, robust automation and testing for complex GUI environments.

GUI-Owl is a foundational multimodal agent model purpose-built for end-to-end Graphical User Interface (GUI) automation across desktop and mobile environments. It consolidates perception, grounding, planning, procedural reasoning, and action execution into a unified policy network. GUI-Owl’s architecture and data infrastructure enable high proficiency in grounding, question answering, multi-turn planning, and real-world execution, with demonstrated state-of-the-art results among open-source systems on established GUI agent benchmarks (Ye et al., 21 Aug 2025).

1. System Architecture and Design Principles

GUI-Owl employs a native, fully end-to-end policy network constructed atop large-scale vision-LLMs (notably Qwen2.5-VL derivatives), post-trained with a diverse GUI-centric interaction corpus. The model’s agent-centric architecture is designed to jointly encode the GUI state, context-awareness (multi-modal perception of screenshots and UI descriptors), planning traces, intermediate reasoning steps, and action execution semantics.

The system encompasses three tightly integrated components:

  • Large-scale Environment Infrastructure: A cloud-based virtual environment supporting Android, Ubuntu, macOS, and Windows, underpins a Self-Evolving GUI Trajectory Production pipeline. This framework continuously generates, validates, and refines high-quality demonstration and interaction data with minimal manual effort.
  • Unified Foundational Agent Capability: GUI-Owl supports UI element grounding, procedural and reactive planning, trajectory-aware reasoning, and action prediction in an integrated fashion, enabling both solo agent operation and participation as modules within larger multi-agent workflows.
  • Scalable Asynchronous Environment RL: Reinforcement learning is realized via decoupled, fully asynchronous policy optimization pipelines, allowing large-scale, continuous policy updating.

This design allows the model to directly learn from GUI state transitions, screenshots (vision), and historical interaction sequences, unifying perception and procedural planning.

2. Data Synthesis Pipeline and Self-Evolving Infrastructure

A pivotal innovation is the Self-Evolving GUI Trajectory Production infrastructure. This system autonomously operates within the multi-OS virtual environment, generating a wide range of interaction demonstrations:

  • Automated query generation mimics realistic user intents and task prompts.
  • GUI-Owl iteratively produces and executes action plans, recording before-and-after states (screenshots, accessibility tree, UI metadata).
  • Automated correctness validation is performed at both the individual action and trajectory levels, ensuring high-quality, validated training data.
  • Feedback from validation and deployment is integrated to iteratively refine the demonstration set, forming a self-improving, closed-loop data enhancement cycle.

This infrastructure supports diverse data pipelines—including grounding, question answering, multi-turn planning, and higher-level reasoning—substantially reducing manual annotation costs and capturing realistic, environment-aligned trajectories.

3. Model Capabilities: Grounding, Planning, and Decision-Making

GUI-Owl’s foundational abilities include:

  • UI Grounding: The model employs large-scale annotated datasets and active data pipelines to ground UI elements (text, icons, widgets) amidst densely packed and visually heterogeneous layouts.
  • Procedural and Decision Planning: Leveraging historical GUI trajectory data distilled alongside pretrained LLMs, the planner can decompose complex, long-horizon goals into robust, executable action plans.
  • Action Semantics: By modeling the mapping from (state, action) to (next state), GUI-Owl learns action-effect relationships and can reason about how sequences yield state changes.
  • Intermediate Reasoning: Advanced agent reasoning patterns, including multi-agent distillation and hint-guided rejection sampling, support robust multi-turn interactions and collaborative agent roles.
  • Modular Integration: GUI-Owl is natively deployable as a standalone agent or as a module (e.g., planner, executor, reader, notetaker) within multi-agent collaborative systems.

This enables not only direct automation (planning and execution) but also novel agent specialization and collaboration scenarios.

4. Reinforcement Learning and Trajectory-Aware Policy Optimization

Scalable real-world alignment is achieved using a custom reinforcement learning scheme:

  • The RL framework fully decouples experience generation from model (policy) optimization, facilitating asynchronous, high-throughput training.
  • Trajectory-Aware Relative Policy Optimization (TRPO): The key innovation here is the assignment of trajectory-level rewards, which are normalized and then uniformly attributed to all actions within a trajectory:

ΔAt=R(τ)RˉσR+ϵ\Delta A_t = \frac{R(\tau) - \bar{R}}{\sigma_R + \epsilon}

The loss is given by

LTRPO=1Ni=1Gs=1Sit=1oi,smin[rt(θ)ΔAt,clip(rt(θ),1ϵ,1+ϵ)ΔAt]\mathcal{L}_{TRPO} = -\frac{1}{N} \sum_{i=1}^G \sum_{s=1}^{S_i} \sum_{t=1}^{|o_{i,s}|} \min \big[ r_t(\theta) \Delta A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \Delta A_t \big]

  • This approach effectively mitigates the sparse reward problem inherent in long-horizon GUI tasks, leading to high environment- and task-alignment.

Empirically, the GUI-Owl-7B RL-optimized variant achieves a score of 34.9 on the OSWorld benchmark, with further improvement (37.7) using the multi-agent 32B setting.

5. Benchmark Performance, Evaluation, and Open Source Contribution

GUI-Owl sets state-of-the-art scores among open-source GUI automation models:

  • AndroidWorld: 66.4 (GUI-Owl-7B), 73.3 (Mobile-Agent-v3 integration).
  • OSWorld: 29.4 (standard), 34.9 (RL-tuned), 37.7 (32B multi-agent).
  • These results establish new baselines for foundational agents in GUI automation, surpassing prior large-scale models in both accuracy and robustness.

Both GUI-Owl and Mobile-Agent-v3 are open sourced, including full models, environment orchestration code, trajectory data pipelines, and RL training frameworks (https://github.com/X-PLUG/MobileAgent), supporting rapid community research and real-world deployment.

Model / Framework AndroidWorld OSWorld
GUI-Owl-7B 66.4 29.4
GUI-Owl-7B (TRPO-RL) 34.9
GUI-Owl-32B (multi-ag) 37.7
Mobile-Agent-v3 73.3 37.7

6. Applications and Future Research Directions

GUI-Owl serves as a general-purpose agent for GUI automation across a spectrum of platforms and applications, including:

  • Automated multi-turn GUI testing, question answering, and instruction following on Android, desktop (Windows, Ubuntu, macOS), and web.
  • Integration as specialized modules to coordinate with other agents in complex interactive systems.
  • Flexible adaptation through multi-agent task decomposition, allowing for robust handling of long-horizon and collaborative workflows.

The model and infrastructure support future expansion in several dimensions:

  • Enriching the diversity and realism of synthesized GUI trajectories for highly complex or novel user scenarios.
  • Incorporation of additional modalities or external knowledge sources to address atypical GUI environments.
  • Scaling RL and agent collaboration architectures for increasingly elaborate tasks, longer trajectories, and heterogeneous contexts.
  • Facilitating rapid baseline creation and innovation for next-generation GUI automation research.

GUI-Owl represents a convergence of large-scale data infrastructure, unified foundational agent design, and scalable RL, establishing a robust open-source standard and a fertile foundation for continued advances in GUI-centric intelligent agents (Ye et al., 21 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)