Mobile-Agent-RAG: Dual-Level Mobile Automation
- Mobile-Agent-RAG is a hierarchical multi-agent framework that integrates dual-level retrieval-augmented generation to enhance both planning and execution in mobile automation.
- It employs separate knowledge bases for strategic planning (Manager-RAG) and precise app-specific actions (Operator-RAG), effectively reducing planning errors and UI mistakes.
- Empirical results show notable improvements in task completion and efficiency compared to baselines, validating its robust design for long-horizon mobile tasks.
Mobile-Agent-RAG refers to a class of hierarchical multi-agent frameworks for mobile automation, distinguished by retrieval-augmented generation (RAG) at multiple control levels to drive reliable, long-horizon, cross-application task execution on mobile devices. In contrast with prior approaches that over-rely on static internal knowledge of multimodal LLMs (MLLMs), Mobile-Agent-RAG explicitly partitions the planning and execution stages and equips each with targeted, separately retrievable external knowledge bases. The architecture, retrieval principles, empirical evaluation, and comparative results are synthesized below based fundamentally on "Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation" (Zhou et al., 15 Nov 2025).
1. Hierarchical Multi-Agent Architecture with Dual-Level RAG
Mobile-Agent-RAG employs a hierarchical agent loop where, at each timestep :
- The Perceptor module processes the previous screenshot , producing a fine-grained visual state using OCR and icon grounding.
- The Manager agent , augmented by Manager-RAG, decomposes a high-level user instruction into a plan and the next app-specific subtask .
- The Operator agent , augmented by Operator-RAG, translates the current subtask and visual context into an atomic UI action , then executes it via Android Debug Bridge (ADB).
- Perceptor updates the state , and the Action Reflector module compares and to generate outcome, progress signal , and error logs .
- The Notetaker accumulates extracted contextual information (e.g., phone numbers), maintaining dynamic notes .
This division enables the architecture to handle fundamentally distinct knowledge requirements: high-level, validated strategic experience for planning, and granular, app-specific UI actions for execution. Manager-RAG reduces “strategic hallucinations” by leveraging human-verified plans; Operator-RAG mitigates operational UI errors by providing precise, contextually-aligned guidance. Feedback-driven planning and notetaking enhance reliability across multi-step, multi-app scenarios (Zhou et al., 15 Nov 2025).
2. Retrieval-Augmented Knowledge Base Design
Mobile-Agent-RAG constructs two specialized retrieval-oriented knowledge bases:
- Manager-RAG KB (): Approximately 50 entries, each a tuple —a natural language instruction and a human-verified sequence of steps. Entries are indexed as 768-dimensional dense vectors (Contriever-MSMARCO encoder), facilitating vector similarity search.
- Operator-RAG KB (): For each app, 17–28 triplets per app, , where is subtask text (embedded as above), is a reference screenshot, and is the atomic action with arguments. Each app maintains a dedicated vector index.
These KBs decouple high-level, strategic knowledge (global index) from low-level, app-specific action knowledge (per-app index), enabling precise, context-aware retrieval and action selection (Zhou et al., 15 Nov 2025).
3. Retrieval-Augmentation Mechanisms and Pseudocode
Retrieval is conducted through vector similarity in a shared embedding space. Key formulas and algorithms include:
- Manager-RAG Retrieval (Top-):
Top- results are provided for plan prompting.
- Operator-RAG Retrieval ():
The top match informs atomic action prompting.
These mechanisms are instantiated in workflow pseudocode for both Manager (planning) and Operator (execution) agents, demonstrating how retrieval integrates into contextual prompting for the MLLM (Zhou et al., 15 Nov 2025).
4. Mobile-Eval-RAG Benchmark and Evaluation Metrics
Mobile-Eval-RAG establishes a task suite of 50 realistic, long-horizon mobile tasks spanning five categories—20 simple (information search, trending analysis) and 30 complex (restaurant recommendation, online shopping, travel planning)—each spanning 2–3 apps with an average of 16.9 steps per task. Evaluation metrics include:
- Success Rate (SR): Percentage of tasks completed within 30 steps, avoiding repetitive actions and yielding correct judgment.
- Completion Rate (CR): Proportion of subgoals completed.
- Operator Accuracy (OA): Correct atomic operations over total steps.
- Reflector Accuracy (RA): Reflection accuracy per step.
- Steps and Efficiency (CR divided by steps): Measures process economy.
These metrics jointly adjudicate planning quality, execution fidelity, and overall efficiency (Zhou et al., 15 Nov 2025).
5. Empirical Results and Comparative Analysis
On the Mobile-Eval-RAG benchmark, using Gemini-1.5-Pro backbone, Mobile-Agent-RAG outperforms strong baselines as shown:
| Method | CR (%) | Steps | Efficiency | SR (%) |
|---|---|---|---|---|
| Mobile-Agent-E | 58.3 | 22.4 | 2.60 | 48.0 |
| Mobile-Agent-E+Evo | 61.2 | 21.8 | 2.81 | 56.0 |
| Mobile-Agent-RAG | 75.7 | 18.8 | 4.03 | 76.0 |
Key gains: +11.0 percentage points in completion rate (), +10.2% in efficiency (). Ablations show removing Manager-RAG results in a ~14 point CR drop; removing Operator-RAG incurs >20 point loss in OA, efficiency, and SR. Results are robust across other MLLMs (e.g., GPT-4o, Claude-3.5). Performance remains stable over repeated trials and gains are well-distributed across all measured dimensions (Zhou et al., 15 Nov 2025).
6. Broader Context and Methodological Comparisons
Mobile-Agent-RAG’s design directly addresses strategical hallucination (high-level planning error) and operational failures (low-level UI mistakes) by aligning retrieval granularity with functional agent roles. The dual-level RAG approach is sharply differentiated from prior agents that rely primarily on monolithic retrieval or static model priors.
Related work in the mobile agent RAG space includes flexible knowledge base documentation via exploration and RAG-driven deployment (AppAgent v2 (Li et al., 5 Aug 2024)), modular RAG architectures (MobileRAG (Loo et al., 4 Sep 2025)), on-device RAG with memory and power optimizations (MobileRAG/EcoVector (Park et al., 1 Jul 2025)), and distributed/edge-collaborative hybrid deployments (EACO-RAG (Li et al., 27 Oct 2024)). Knowledge-augmented fine-tuning further improves agent factuality and robustness by aligning runtime knowledge exposure with training conditions (Cai et al., 28 Jun 2025), but Mobile-Agent-RAG is distinctive in its hierarchical separation of strategic and operational knowledge and explicit dual-level retrieval logic.
7. Significance and Future Directions
Mobile-Agent-RAG establishes a robust, context-aware paradigm for multi-agent mobile automation, providing systematic gains in task completion and process efficiency while addressing generalization limitations of vanilla MLLM-based agents. Empirical evidence demonstrates balanced improvements across planning, execution, reliability, and efficiency. Remaining challenges include scaling specialized retrieval KBs, adapting to on-device resource constraints, and handling highly dynamic or previously unseen UI/app states. Explorations into more sophisticated edge/cloud hybrid deployment, memory-efficient local retrieval caches, and end-to-end co-training of retrieval and language modules represent anticipated avenues to further mature Mobile-Agent-RAG systems (Zhou et al., 15 Nov 2025, Lin et al., 4 Nov 2025, Li et al., 27 Oct 2024).