Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 105 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 34 tok/s Pro
GPT-4o 108 tok/s
GPT OSS 120B 473 tok/s Pro
Kimi K2 218 tok/s Pro
2000 character limit reached

Mobile-Agent-v3: Foundamental Agents for GUI Automation (2508.15144v1)

Published 21 Aug 2025 in cs.AI

Abstract: This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a modular multi-agent framework that unifies perception, reasoning, planning, and action execution for enhanced GUI automation.
  • It introduces a self-evolving trajectory data production pipeline and scalable reinforcement learning to improve performance across diverse benchmarks.
  • Empirical results demonstrate state-of-the-art scores on both desktop and mobile environments, validating the effectiveness of the GUI-Owl model.

Mobile-Agent-v3: Foundational Agents for GUI Automation

Overview and Motivation

Mobile-Agent-v3 introduces a comprehensive agentic framework for GUI automation, centered around the GUI-Owl model. The work addresses the limitations of prior approaches—namely, the lack of generalization to unseen tasks, poor adaptability to dynamic environments, and insufficient integration with multi-agent frameworks. GUI-Owl is designed as a native end-to-end multimodal agent, unifying perception, grounding, reasoning, planning, and action execution within a single policy network. The system is evaluated across ten benchmarks spanning desktop and mobile environments, demonstrating robust performance in grounding, question answering, planning, decision-making, and procedural knowledge. Figure 1

Figure 1: Performance overview on mainstream GUI-automation benchmarks.

System Architecture and Multi-Agent Framework

Mobile-Agent-v3 is built upon GUI-Owl and extends its capabilities through a modular multi-agent architecture. The framework supports both desktop and mobile platforms, leveraging a cloud-based virtual environment infrastructure for scalable data collection and training. The agentic system comprises specialized modules:

  • Manager Agent: Strategic planner, decomposes high-level instructions into subgoals and dynamically updates plans based on feedback.
  • Worker Agent: Executes actionable subgoals, interacts with the GUI, and records reasoning and intent.
  • Reflector Agent: Evaluates outcomes, provides diagnostic feedback, and enables self-correction.
  • Notetaker Agent: Maintains persistent contextual memory, storing critical information for long-horizon tasks.
  • RAG Module: Retrieves external world knowledge to inform planning and execution. Figure 2

    Figure 2: Overview of Mobile-Agent-v3, illustrating multi-platform support and core capabilities.

    Figure 3

    Figure 3: Mobile-Agent-v3 architecture, detailing the six core modules and their interactions.

End-to-End GUI Interaction and Reasoning

GUI-Owl models GUI interaction as a multi-turn decision process, where the agent selects actions based on current observations and historical context. The interaction flow is structured to maximize reasoning transparency and adaptability:

  • System Message: Defines available action space.
  • User Message: Contains task instructions, compressed histories, and current observations.
  • Response Message: Includes agent's reasoning, action summaries, and final action output.

The agent is required to output explicit reasoning before executing actions, with conclusions summarized and stored to manage context length constraints. Figure 4

Figure 4: Illustration of the interaction flow of GUI-Owl, showing the structured message exchange.

Self-Evolving Trajectory Data Production

A key innovation is the self-evolving trajectory data production pipeline, which automates the generation and validation of high-quality interaction data. This pipeline leverages GUI-Owl's own capabilities to roll out trajectories, assess correctness, and iteratively improve the model. The process includes:

  • High-Quality Query Generation: DAG-based sampling for mobile apps, deep-search chains for desktop applications, and LLM-assisted synthesis.
  • Trajectory Correctness Judgment: Step-level and trajectory-level critics, combining textual and multimodal reasoning channels.
  • Query-Specific Guidance Generation: VLM-based action outcome descriptions and LLM-based guidance synthesis. Figure 5

    Figure 5: Illustration of the self-evolving trajectory data production pipeline.

Diverse Data Synthesis for Foundational Capabilities

The framework constructs specialized datasets for grounding, planning, and action semantics:

  • Grounding: Combines open-source datasets, accessibility tree extraction, and dense region segmentation for precise UI element localization.
  • Task Planning: Distills procedural knowledge from historical trajectories and large-scale LLMs.
  • Action Semantics: Annotates pre- and post-action screenshots, requiring the model to predict actions and describe their effects. Figure 6

    Figure 6: Overview of the grounding data construction pipeline.

Scalable Reinforcement Learning Infrastructure

Mobile-Agent-v3 employs a scalable RL infrastructure, supporting asynchronous training and decoupled rollout–update processes. The RL framework unifies single-turn reasoning and multi-turn agentic training, enabling high-throughput experience generation and policy optimization. The introduction of Trajectory-aware Relative Policy Optimization (TRPO) addresses the challenge of sparse, delayed rewards in long-horizon GUI tasks by distributing normalized trajectory-level advantages across all steps. Figure 7

Figure 7: Overview of the scalable RL infrastructure, highlighting parallelism and unified interfaces.

Training Paradigm and Data Management

GUI-Owl is initialized from Qwen2.5-VL and trained in three stages:

  1. Pre-training: Large-scale corpus for UI understanding and reasoning.
  2. Iterative Tuning: Real-world deployment, trajectory cleaning, and offline reasoning data synthesis.
  3. Reinforcement Learning: Asynchronous RL for direct environment interaction, focusing on execution consistency and success rate. Figure 8

    Figure 8: Training dynamics of GUI-Owl-7B on OSWorld-Verified, showing the impact of different data filtering and experience management strategies.

Empirical Results and Analysis

GUI-Owl-7B and GUI-Owl-32B achieve state-of-the-art results across grounding, comprehensive GUI understanding, and end-to-end agentic benchmarks. Notably:

  • AndroidWorld: GUI-Owl-7B scores 66.4, Mobile-Agent-v3 reaches 73.3.
  • OSWorld: GUI-Owl-7B scores 34.9 (RL-tuned), Mobile-Agent-v3 achieves 37.7.
  • MMBench-GUI-L2: GUI-Owl-7B scores 80.49, GUI-Owl-32B reaches 82.97, outperforming proprietary models including GPT-4o and Claude 3.7.

Performance scales with increased historical context and interaction-step budgets, indicating strong long-horizon reasoning and correction capabilities. Figure 9

Figure 9: Performance of GUI-Owl-7B on OSWorld-Verified with varying historical images and step budgets.

Figure 10

Figure 10: Effect of reasoning data synthesis on Android World, demonstrating incremental gains from diverse reasoning sources.

Agentic Workflow and Case Study

The integrated workflow of Mobile-Agent-v3 is formalized as a cyclical process, with agents coordinating to decompose tasks, execute actions, reflect on outcomes, and persist critical information. The system demonstrates robust self-correction and adaptability in complex desktop and mobile scenarios. Figure 11

Figure 11: A case of a complete Mobile-Agent-3 operation process on a desktop platform, highlighting successful reflection and correction.

Implications and Future Directions

Mobile-Agent-v3 establishes a new paradigm for GUI automation, combining self-evolving data pipelines, modular multi-agent architectures, and scalable RL. The demonstrated generalization across platforms and benchmarks suggests strong potential for real-world deployment in productivity, accessibility, and autonomous digital assistants. The explicit reasoning and reflection mechanisms provide a foundation for further research into explainable agentic systems and robust long-horizon planning.

The trajectory-level RL and experience management strategies highlight the importance of data efficiency and dynamic adaptation in agentic training. Future work may explore more granular credit assignment, hierarchical planning, and integration with external tool-use and retrieval systems.

Conclusion

Mobile-Agent-v3 and GUI-Owl represent a significant advancement in foundational agents for GUI automation, achieving robust cross-platform performance and demonstrating effective integration of perception, reasoning, planning, and action. The modular multi-agent framework, self-evolving data production, and scalable RL infrastructure collectively enable state-of-the-art results and provide a blueprint for future agentic systems in dynamic digital environments.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com