Mobile-Agent-v2: Multi-Agent UI Automation
- Mobile-Agent-v2 is a multi-agent framework that decomposes mobile UI tasks into agents for enhanced scalability and performance.
- It employs a modular design with dedicated Planning, Decision, and Reflection agents to manage task progress, focus content, and error correction.
- The architecture achieves over 30% task completion improvement compared to single-agent systems by integrating hierarchical planning, efficient memory, and multi-modal perception.
Mobile-Agent-v2 is a @@@@1@@@@ for mobile device operation assistance, designed to overcome the navigation and scalability challenges that limit the performance of existing single-agent architectures in multimodal user interface automation. This architecture decomposes high-level mobile tasks—often involving long historical sequences of screenshots and actions—into specialized agent roles, each optimized for progress tracking, content focus retention, and robust error correction. Mobile-Agent-v2 demonstrates over 30% absolute improvement in task completion compared to monolithic single-agent baselines. It integrates hierarchical planning, efficient focus memory, multi-modal perception, and reflection-based correction, laying a foundation for scalable, accurate, and efficient mobile interaction agents (Wang et al., 2024, Zhang et al., 30 Aug 2025, Jiang et al., 24 Oct 2025).
1. Problem Definition and Motivation
Mobile-Agent-v2 addresses two core navigation challenges encountered in mobile device operation tasks using multimodal LLMs (MLLMs):
- Task-progress navigation: Single-agent models struggle to manage and reason over long interleaved histories of screenshots and UI operations, making it difficult to infer completed subtasks and plan the remainder of the workflow.
- Focus-content navigation: Critical content (e.g., scores, text fields) encountered in earlier screens is difficult to retrieve efficiently, due to memory constraints and sequential processing inefficiencies of single-agent, end-to-end MLLMs.
Let denote the sequence of screenshots, and the multi-modal perception outputs at step . A single-agent model must generate each operation:
As grows, the input token sequence becomes prohibitively long, reducing throughput and task reasoning accuracy. Mobile-Agent-v2 is designed to factorize this process into compact, communicative stages that encapsulate subtask tracking and focused memory, drastically improving operational scalability (Wang et al., 2024).
2. Multi-Agent System Architecture
Mobile-Agent-v2 comprises three specialized, iterative agents, structured in a pipeline:
- Planning Agent (PA): Summarizes completed subtasks and maintains a concise "task progress" (TP) summary; formalized as , where is the focus memory from the preceding step.
- Decision Agent (DA): Receives the current goal, task progress, focus content, and perception outputs; reasons over current state with multi-modal context to select the next operation and update the focus content . The operation space is discretized (e.g., Open app, Tap, Swipe, Type, Home, Stop).
- Reflection Agent (RA): Receives operation outcomes, compares pre- and post-action UI states (, ), and categorizes the result as Correct, Ineffective, or Erroneous:
Based on , the system records, reverts, or retries the decision step as necessary.
Visual Perception Integration
Mobile-Agent-v2 uses a Visual Perception Module (VPM) incorporating OCR (ConvNextViT-document), icon detection (GroundingDINO), and multimodal description (Qwen-VL) to distill from , presenting structured UI information to downstream agents (Wang et al., 2024, Zhang et al., 30 Aug 2025).
3. Focus Content Memory and Progress Encoding
A short-term textual memory unit maintains "focus content" (), storing semantically critical state information (e.g., recognized labels, scores, highlighted items) required for later subtasks. Only the Decision Agent updates based on the current perception:
The memory is kept compact and is designed to be queryable by both the Planning and Decision agents. Summaries are maintained as natural-language lists of relevant items, with updates enforced only when new or changed content is detected on the screen.
Task progress (TP) is tracked by the Planning Agent and captured as concise free-text summaries, which are then fed back into the operating loop to shrink the context required for downstream reasoning.
4. Reflection, Error Correction, and Execution Loop
After each operation execution, the Reflection Agent assesses the impact by comparing and and, if the effect is Correct, commits the operation to the history and advances the pipeline. If Ineffective, no action is recorded and a retry is triggered. If Erroneous, the device state is reverted before issuing a new action proposal.
This process ensures faulty actions are pruned early, task drift is minimized, and the agent avoids infinite loops or regression in progress (Wang et al., 2024).
5. Training Paradigm and Data Pipeline
The most effective Mobile-Agent-v2 instantiations employ modular MLLMs (e.g., the MobiMind series in "MobiAgent"), each tailored for planning, micro-level reasoning, or grounding UI targets. The principal pipeline includes:
- Supervised Fine-Tuning (SFT): Standard likelihood-based training on annotated action–reason trajectories, enforcing JSON-structured outputs and strict typing.
- Curriculum Reinforcement (GRPO): Specialized RL algorithms with reward signals for accurate action prediction, bounding box localization (IoU-based), and content matching. For click actions, the reward incorporates IoU and bounding box center proximity:
- Self-Evolution: Failed execution traces are corrected and reincorporated for subsequent fine-tuning or RL stages.
Data curation uses a multi-stage, AI-assisted pipeline with real annotators, VLM-driven reasoning annotation, error filtering, and task complexity augmentation. Compared to full manual annotation, this approach yields a 5–10× reduction in cost (Zhang et al., 30 Aug 2025).
6. Evaluation, Performance, and Ablation
Mobile-Agent-v2 is evaluated on a diverse suite of real-world mobile benchmarks, including device variety (HarmonyOS, Android), system/external apps, and both single- and multi-app workflows.
- Success Rate (SR): The fraction of instructions/tasks fully completed. Across all categories, v2 yields at least 30% absolute improvement over Mobile-Agent single-agent systems.
- Completion Rate (CR): Fraction of correct steps versus ground-truth steps; v2 achieves up to 100% with "knowledge injection."
- Ablation findings:
- Removing the Planning Agent: SR falls from ~90% to ~59% (basic) and ~29% (advanced).
- Removing the Reflection Agent: ~10% drop in CR and DA.
- Omitting focus memory: multi-app SR falls by ~40%.
- These results underscore the critical role of explicit context decomposition, focused memory, and immediate reflection/error correction (Wang et al., 2024, Zhang et al., 30 Aug 2025).
7. Acceleration, Device-Cloud Cooperation, and Future Directions
Recent system frameworks (e.g., AgentRR in "MobiAgent," LightAgent's device–cloud scheduler) further accelerate v2 systems:
- ActTree & AgentRR: Online caching of actions and UI transitions (ActTree) enables high rates of cache replay (30–85%, depending on task distribution) and 2×–3× speedup in end-to-end latency by reusing prior computation (Zhang et al., 30 Aug 2025).
- Device-Cloud Two-level Scheduling: LightAgent-style architectures introduce on-device agents (3B–6B scale) for baseline reasoning, with a controller that escalates only difficult or failed subtasks to cloud-backend models. Real-time complexity assessments and learnable switching modules balance cost and performance, achieving ~10–15% cloud API cost savings and matching pure cloud (Gemini-2.5-Pro/GPT-5) results within 9.4 points in SR (Jiang et al., 24 Oct 2025).
- Scalability and Modality: Modular expansion to handle long histories (hierarchical memory, summary prompts), richer modalities (audio, sensor), and open-ended task flows is prioritized for future v2 developments.
Key limitations include ongoing reliance on high-cost large models for some components, simplistic memory when faced with very long or cross-day threads, and residual needs for manual knowledge curation in complex workflows. The anticipated research direction focuses on on-device distillation, hierarchical or windowed memory management, and further automation of knowledge base construction (Wang et al., 2024, Zhang et al., 30 Aug 2025, Jiang et al., 24 Oct 2025).
In summary, Mobile-Agent-v2 exemplifies state-of-the-art architectural decomposition for mobile UI agents, combining modular planning, reasoning, and reflection, advanced memory and perception modules, runtime experience replay, and pragmatic device–cloud scheduling. This yields robust improvements in both accuracy and efficiency, representing a scalable blueprint for real-world multimodal mobile operation assistants (Wang et al., 2024, Zhang et al., 30 Aug 2025, Jiang et al., 24 Oct 2025).