Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 35 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 106 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 228 tok/s Pro
2000 character limit reached

Mobile-Agent-v3: Multi-Agent GUI Automation

Updated 23 August 2025
  • Mobile-Agent-v3 is a comprehensive multi-agent GUI automation framework that integrates cloud infrastructure, modular agents, and trajectory-aware reinforcement learning to achieve robust long-horizon task execution.
  • It sets new benchmark standards by boosting AndroidWorld scores from 66.4 to 73.3 and OSWorld scores from 29.4 to 37.7 through precise policy optimization and multi-step validation.
  • Its modular architecture, featuring specialized agents for UI grounding, task planning, and collaborative reasoning, enables scalable self-improving loops and reliable automation across diverse platforms.

Mobile-Agent-v3 is a general-purpose, open-source multi-agent GUI automation framework that builds on and extends the capabilities of GUI-Owl, achieving state-of-the-art results on a spectrum of benchmark environments spanning mobile (AndroidWorld) and desktop (OSWorld) platforms. Its architecture fuses large-scale, cross-platform environment infrastructure, modular multi-agent designs, advanced UI-grounding and planning, and a scalable reinforcement learning mechanism optimized for trajectory-level policy refinement. Mobile-Agent-v3 not only addresses the challenges of reliable, long-horizon GUI automation but also establishes itself as a foundation for further research and deployment in GUI, digital assistant, and multimodal automation scenarios (Ye et al., 21 Aug 2025).

1. Performance and Benchmarking

Mobile-Agent-v3 attains leading performance metrics across benchmark suites for GUI automation, most notably:

Platform Mobile-Agent-v3 Score GUI-Owl Baseline OpenCUA Baseline
AndroidWorld 73.3 66.4 <66.4
OSWorld-Verified 37.7 29.4 Lower/NA
  • AndroidWorld—a mobile interaction benchmark—witnessed a boost from 66.4 (GUI-Owl) to 73.3 (Mobile-Agent-v3), underlining robust mobile-specific improvements.
  • OSWorld, targeting desktop environments, saw an increase from 29.4 to 37.7, surpassing prior open-source models such as OpenCUA.
  • These gains reflect not only improved single-action decision-making but heightened reliability in executing long-horizon and multi-step tasks.

The system incorporates step-level critics and trajectory-level multimodal critics, ensuring that each trajectory (sequence of screen states and actions) is rigorously validated, and the feedback loop drives both data cleaning and model improvement.

2. Large-Scale, Cross-Platform Environment Infrastructure

Mobile-Agent-v3 is underpinned by a cloud-based virtual infrastructure that spans Android, Ubuntu, macOS, and Windows.

  • Self-Evolving GUI Trajectory Production Framework: Synthetic and real interaction trajectories are generated using automated, high-quality query generation emulating realistic user instructions. Each action step is automatically validated—incorrect or suboptimal trajectory segments are flagged, pruned, and recycled to refine the data.
  • The iterative data production forms a self-improving loop: GUI-Owl is used to bootstrap new data, which is then further validated and used for model retraining.
  • This continuous, closed-loop pipeline allows for scalable buildout of diverse, high-fidelity training corpora for both mobile and desktop workflows, dramatically reducing dependence on manual annotation.

3. Modular Agent Capabilities and Multi-Agent System Design

Mobile-Agent-v3 systematically decomposes automation challenges into specialized, collaborative agent modules, serving both as unified standalone agents and as interoperable components in larger agentic systems:

  • UI Grounding: Robust localization of screen elements using segmentation and detection techniques.
  • Task Planning and Action Semantics: Extraction and generalization of procedural knowledge—breaking down high-level tasks into actionable subgoals, semantically mapping both primitive actions (click, drag, type) and their consequences (as seen in before/after state pairs).
  • Diverse Reasoning Patterns: The system leverages offline hint-guided rejection sampling, collaborative multi-agent distillation, and multimodal critics. This diversity allows the agent to reason adaptively, not only following fixed scripts but flexibly adjusting in response to real-time feedback.
  • Multi-Agent Roles: Distinct roles are implemented, including Manager (task decomposition), Worker (action execution), Reflector (critic/judgment), and Notetaker (context preservation).

This modular architecture supports composability—agents can interact, share context, and collaborate on complex, heterogeneous instruction sequences involving multiple GUIs or devices.

4. Reinforcement Learning and Trajectory-Aware Policy Optimization

Mobile-Agent-v3 introduces a scalable, asynchronous reinforcement learning pipeline specifically adapted for GUI environments:

  • TRPO (Trajectory-aware Relative Policy Optimization) is used to optimize agent policies at the trajectory level, rather than individual steps:

    A^τ=R(τ)RˉσR+ε\hat{A}_\tau = \frac{R(\tau) - \bar{R}}{\sigma_R + \varepsilon}

    Here, R(τ)R(\tau) denotes the holistic reward for trajectory τ\tau, while Rˉ\bar{R} and σR\sigma_R are its running mean and standard deviation.

  • Rewards are assigned globally per trajectory and then uniformly distributed across its constituent actions, a method designed to address the credit assignment problem endemic to long-horizon GUI tasks with sparse rewards.
  • The policy objective is given by the TRPO loss:

    LTRPO=1Ni=1Gs=1Sit=1oi,smin[rt(θ)A^τi,  clip(rt(θ),1 ⁣ ⁣ε,1 ⁣+ ⁣ε)A^τi]L_{TRPO} = -\frac{1}{N} \sum_{i=1}^{G} \sum_{s=1}^{S^i} \sum_{t=1}^{|o_{i,s}|} \min \left[ r_t(\theta) \hat{A}_{\tau_i}, \; \text{clip}\left(r_t(\theta), 1\!-\!\varepsilon, 1\!+\!\varepsilon\right) \hat{A}_{\tau_i} \right]

    where rt(θ)r_t(\theta) is the ratio of policy probabilities, effectively regularizing updates to ensure stable learning.

This approach stabilizes training and improves convergence, outperforming standard RL methods on long-horizon GUI tasks. For example, with TRPO, OSWorld scores reach 34.9 (surpassing prior baselines).

5. Data Pipelines and Self-Improving Loops

The training regime continuously fuses human-like query generation, extensive environment rollouts, and automated judgment:

  • Query Generation: High-quality, task-diverse DAGs and GUI metadata are used to synthesize queries mimicking authentic human commands.
  • Automated Critic Modules: Both step-level and trajectory-level automated critics (leveraging multimodal models) assess and filter the generated trajectories for both correctness and diversity.
  • Feedback and Self-Evolution: Cleaned and validated data is reintroduced into the model training loop. This approach produces distributions of exemplars covering grounding, reasoning, and procedural semantics—foundational for robust generalization.

6. Open Source Release and Community Impact

The Mobile-Agent-v3 system, including GUI-Owl models and trajectory data pipelines, is available as open source at https://github.com/X-PLUG/MobileAgent.

  • This release provides the research and practitioner community with a state-of-the-art multi-agent GUI automation system, facilitating both reproducibility and extensibility.
  • The framework is designed for application in automated digital assistants, wide-scale UI testing, workflow automation, and as a baseline for further multi-agent and reinforcement learning research in GUI contexts.

7. Significance and Future Directions

Mobile-Agent-v3 establishes new standards for open-source, multi-agent GUI automation with substantive empirical advances on challenging real-world benchmarks. It enables robust, generalizable, and efficient long-horizon interaction—integrating scalable cloud infrastructure, modular agent design, diverse data pipelines, and trajectory-aware reinforcement learning. The open-source contribution is poised to accelerate both academic inquiry and practical deployments, fostering further work in autonomous GUI interaction, digital task assistance, and the evolution of agent-based systems on heterogeneous device ecologies (Ye et al., 21 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube