Mobile-Agent-v3.5 (GUI-Owl-1.5)

Updated 5 March 2026

Mobile-Agent-v3.5 is a state-of-the-art GUI agent that automates tasks across desktop, mobile, and browser environments using cloud-edge collaboration.
It features both Instruct and Thinking variants, offering a trade-off between low-latency tool calls and explicit chain-of-thought planning.
The model leverages a hybrid data flywheel and MRPO reinforcement learning, achieving competitive results on over 20 public GUI benchmarks.

Mobile-Agent-v3.5, also presented as GUI-Owl-1.5, is a state-of-the-art, open-source, multi-platform graphical user interface (GUI) agent model targeting the automation, grounding, tool-use, memory, and knowledge reasoning tasks in desktop, mobile, and browser environments via cloud-edge collaboration and real-time interaction. Developed on the Qwen3-VL transformer backbone, GUI-Owl-1.5 encompasses multiple instruct and thinking variants, and incorporates several technical innovations: a hybrid data flywheel pipeline for UI understanding and trajectory generation, a unified thought-synthesis reasoning pipeline, and a novel reinforcement learning algorithm (MRPO) for scaling across multi-platform environments. It achieves leading results on more than 20 public GUI agent benchmarks. All models and an online demo are available as open source (Xu et al., 15 Feb 2026).

1. Model Variants and Architectural Features

GUI-Owl-1.5 inherits its backbone from Qwen3-VL and is instantiated in several parameter scales: 2B, 4B, 8B, 32B (where “B” denotes billions of parameters), as well as a 235B A22B variant. Each size is available both as an “Instruct” and “Thinking” model:

Instruct variants emit a concise action conclusion ( $C_t$ ) and tool call ( $A_t$ ). No intermediate “thought” tokens are output, reducing latency and context length—features desirable for edge deployment.
Thinking variants interleave a chain-of-thought (CoT) segment ( $T_t$ ) prior to $C_t$ , enabling explicit planning, reflection, and error correction, typically yielding superior long-horizon performance at higher computational cost.

All models use standard transformer layers: for input $X \in \mathbb{R}^{N \times d_{\text{model}}}$ , attention operates as $Q = XW^Q$ , $K = XW^K$ , $V = XW^V$ , with $d_k = d_{\text{model}} / H$ . Multi-head self-attention (MHSA) and feed-forward network (FFN) follow conventional formulations. Layer normalization, residual connections, and FFN expansion by a factor of 2 are applied as in baseline transformers. The specific layer counts and hidden dimensions follow Qwen3-VL defaults, though these are not published per scale (Xu et al., 15 Feb 2026).

Thinking variants reserve additional input tokens to accommodate CoT traces. Instruct variants compress context windows by dropping thoughts, thereby optimizing for execution latency.

2. Hybrid Data Flywheel for Trajectory and Grounding

Data generation for GUI-Owl-1.5 follows a “hybrid data flywheel” approach, combining simulated and real environment rollouts to maximize scale and fidelity:

Simulated (virtual) environments: Web-rendered editors and custom DAG sandboxes yield high-throughput, fine-grained data free of device noise.
Cloud-based real device sandboxes: Physical mobile devices, desktop virtual machines, and browsers capture complex GUIs with real-world anomalies such as CAPTCHAs and unpredictable pop-ups.

For grounding, three strategies are used:

Hard Grounding Synthesis: LLM-guided UI layout samplers create professional-app screenshots; multi-window, high-resolution scenarios are synthesized by constraining single-window pools.
Scalable Extension: Critic models mine high-fidelity pairs from real rollouts; tutorials are parsed to extract question-answer pairs; infeasible queries are generated by random pairing, filtered via multi-model consensus.

Trajectory data is synthesized via DAG-based tasks. Automated device rollouts are truncated upon failure to complete a subtask, after which subtasks are repaired and the trajectory re-enqueued. Virtual rollouts employ simulation predicates for prefix cleaning. RPA scripts are used where canonical task policies exist.

Dataset construction is iterative: at each round, current data $D_t$ seeds new synthetic rollouts, which are filtered and merged into the dataset until convergence or resource exhaustion (Xu et al., 15 Feb 2026).

3. Unified Thought-Synthesis Reasoning Pipeline

The model’s supervision employs a unified stepwise thought-synthesis, embedding four reasoning components at each trajectory step $A_t$ 0:

Observation ( $A_t$ 1): VLM generates a description of the initial screen ( $A_t$ 2).
Memory update ( $A_t$ 3): Relevant entities and information are extracted, updating episodic and semantic memory.
Reflection ( $A_t$ 4): VLM compares before/after screenshots and actions; detects progress or corrects error.
Chain-of-Thought and Conclusion ( $A_t$ 5, $A_t$ 6): LLM integrates user query, observations, memory, and reflection before emitting CoT and final action plan.

Pseudocode for the pipeline ensures that memory, planning, and verification are performed per step, with intermediate thought tokens explicitly supervised where relevant.

Multi-agent simulation is structured via four functional roles: a manager plans subgoals, a worker executes, a reflector verifies task transitions, and a notetaker records memory. The agent advances system state $A_t$ 7 iteratively until all subgoals are exhausted.

4. Multi-Platform Reinforcement Learning: MRPO Algorithm

The multi-platform RL setup employs the MRPO (Multi-platform RL with Policy Optimization), a unified approach for diverse device and environment targets.

Unified Policy: A single policy $A_t$ 8, where $A_t$ 9 indexes device type (mobile, desktop, web).
Grouped Rollouts & Outcome-Collapse Mitigation: Tasks are executed in pools ( $T_t$ 0) with subsampling to ensure policy diversity. If rollout outcomes are degenerate (collapse), pool rebalancing or discarding is enforced, preserving on-policy expectations.
Token-ID Transport: Training logs probabilities over tokenized sequences corresponding to trajectories.
Alternating Multi-Platform Optimization: Stages cycle target devices ( $T_t$ 1), with gradients computed against environment-specific rewards.
Surrogate Objective: The RL objective maximizes, for group $T_t$ 2, a reward-weighted log-likelihood sum, subject in principle to trust-region constraints (not implemented in practice).

5. Training Paradigm and Evaluation Protocols

Model training proceeds in three stages:

Pre-training: Mixture of UI recognition, world modeling, QA/VQA, and tool invocation data.
Supervised Fine-Tuning: Combines task trajectory data, CoT, grounding, tool call, and browser interactions.
MRPO Reinforcement Learning: Uses groups (size $T_t$ 3, oversampling $T_t$ 4) and alternates devices per batch for multi-platform alignment.

Reported hyperparameters include learning rates (1e−5 for SFT, 5e−6 for RL), batch size (64), and rollout horizon ( $T_t$ 5 steps for RL).

Evaluations span a comprehensive suite:

Desktop: OSWorld, WindowsAA, OSWorld-MCP
Mobile: AndroidWorld, MobileWorld, MMGUI-Bench
Browser: WebArena, VisualWebArena, WebVoyager, Online-Mind2Web

Benchmarks cover automation (OSWorld-Verified, AndroidWorld, VisualWebArena, etc.), grounding (ScreenSpot-Pro, MMBench-GUI-L2), tool-calling (OSWorld-MCP, MobileWorld), memory (MemGUI-Bench), and knowledge (GUI Knowledge Benchmark). Metrics: end-to-end success rate, grounding accuracy, and knowledge QA accuracy.

6. Empirical Results, Ablations, and Comparative Analysis

GUI-Owl-1.5 demonstrates state-of-the-art performance across standard benchmarks.

Automation Success Rates (selected results):

Model	OSWorld	AndroidWorld	OSWorld-MCP	MobileWorld	WebArena
GUI-Owl-1.5-32B-Inst	56.5	69.8	47.6	46.8	-
GUI-Owl-1.5-8B-Think	52.9	71.6	38.8	33.3	46.7
UI-TARS-2	53.1	73.3	-	-	-

Grounding Accuracy:

ScreenSpot-Pro (two-stage zoom-in): GUI-Owl-1.5-32B-Inst achieves 80.3%
MMBench-GUI-L2: GUI-Owl-1.5-32B-Inst, 86.8% (nearest: MAI-UI-32B at 91.3%)

Knowledge and Memory:

GUI Knowledge Benchmark: GUI-Owl-1.5-32B-Inst, 75.45%
MemGUI-Bench (easy tasks): GUI-Owl-1.5-32B achieves 27.1%, compared to 14.6–22.9% for smaller models or prior versions

Ablation Studies:

Removing virtual environments reduces PC/ mobile evaluation scores from 75.4/86.7% to 42.0/50.0%.
Omitting unified thought-synthesis cuts OSWorld and AndroidWorld performance by several points.

Multi-platform RL conducted with interleaved device training yields more stable gains than naïve mixing, and task-focused RL converges faster on unstable PC tasks.

7. Significance and Open Research Directions

Mobile-Agent-v3.5 (GUI-Owl-1.5) establishes a new performance baseline for multi-platform GUI autonomous agents, with efficacy demonstrated across 20+ open benchmarks. Its design achieves robustness by leveraging hybrid synthetic/real environment data, explicit multi-step reasoning, and reinforcement learning tuned to the unique challenges of interleaved GUI tasks and environments. The architectural dichotomy between Instruct and Thinking variants facilitates both deployment efficiency and high-level planning. The data flywheel, thought-synthesis, and MRPO components collectively address key challenges in grounding, memory, and policy generalization.

The public release, including model weights and a cloud-sandbox demo, provides a foundation for further study into generalist GUI agent architectures and real-world interactive learning (Xu et al., 15 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mobile-Agent-v3.5.