Papers
Topics
Authors
Recent
2000 character limit reached

Mobile-Agent-RAG: Dual-Level Mobile Automation

Updated 22 November 2025
  • Mobile-Agent-RAG is a hierarchical multi-agent framework that integrates dual-level retrieval-augmented generation to enhance both planning and execution in mobile automation.
  • It employs separate knowledge bases for strategic planning (Manager-RAG) and precise app-specific actions (Operator-RAG), effectively reducing planning errors and UI mistakes.
  • Empirical results show notable improvements in task completion and efficiency compared to baselines, validating its robust design for long-horizon mobile tasks.

Mobile-Agent-RAG refers to a class of hierarchical multi-agent frameworks for mobile automation, distinguished by retrieval-augmented generation (RAG) at multiple control levels to drive reliable, long-horizon, cross-application task execution on mobile devices. In contrast with prior approaches that over-rely on static internal knowledge of multimodal LLMs (MLLMs), Mobile-Agent-RAG explicitly partitions the planning and execution stages and equips each with targeted, separately retrievable external knowledge bases. The architecture, retrieval principles, empirical evaluation, and comparative results are synthesized below based fundamentally on "Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation" (Zhou et al., 15 Nov 2025).

1. Hierarchical Multi-Agent Architecture with Dual-Level RAG

Mobile-Agent-RAG employs a hierarchical agent loop where, at each timestep tt:

  • The Perceptor module processes the previous screenshot St1S_{t-1}, producing a fine-grained visual state Vt1V_{t-1} using OCR and icon grounding.
  • The Manager agent MM, augmented by Manager-RAG, decomposes a high-level user instruction II into a plan PtP_t and the next app-specific subtask TtappT_t^{app}.
  • The Operator agent OO, augmented by Operator-RAG, translates the current subtask and visual context into an atomic UI action AtA_t, then executes it via Android Debug Bridge (ADB).
  • Perceptor updates the state VtV_t, and the Action Reflector module compares (St1,Vt1)(S_{t-1}, V_{t-1}) and (St,Vt)(S_t, V_t) to generate outcome, progress signal GtG_t, and error logs LeL^e.
  • The Notetaker accumulates extracted contextual information (e.g., phone numbers), maintaining dynamic notes NtN_t.

This division enables the architecture to handle fundamentally distinct knowledge requirements: high-level, validated strategic experience for planning, and granular, app-specific UI actions for execution. Manager-RAG reduces “strategic hallucinations” by leveraging human-verified plans; Operator-RAG mitigates operational UI errors by providing precise, contextually-aligned guidance. Feedback-driven planning and notetaking enhance reliability across multi-step, multi-app scenarios (Zhou et al., 15 Nov 2025).

2. Retrieval-Augmented Knowledge Base Design

Mobile-Agent-RAG constructs two specialized retrieval-oriented knowledge bases:

  • Manager-RAG KB (KMRK_{MR}): Approximately 50 entries, each a tuple (IMR,HMR)(I_{MR}, H_{MR})—a natural language instruction and a human-verified sequence of steps. Entries are indexed as 768-dimensional dense vectors (Contriever-MSMARCO encoder), facilitating vector similarity search.
  • Operator-RAG KB (KORappK_{OR}^{app}): For each app, 17–28 triplets per app, (TOR,SOR,AOR)(T_{OR}, S_{OR}, A_{OR}), where TORT_{OR} is subtask text (embedded as above), SORS_{OR} is a reference screenshot, and AORA_{OR} is the atomic action with arguments. Each app maintains a dedicated vector index.

These KBs decouple high-level, strategic knowledge (global index) from low-level, app-specific action knowledge (per-app index), enabling precise, context-aware retrieval and action selection (Zhou et al., 15 Nov 2025).

3. Retrieval-Augmentation Mechanisms and Pseudocode

Retrieval is conducted through vector similarity in a shared embedding space. Key formulas and algorithms include:

  • Manager-RAG Retrieval (Top-k=3k=3):

vquery=f(Iquery),simMR(i)=cos(vquery,f(IMR(i)))v_{\mathrm{query}} = f(I_\mathrm{query}),\quad \text{sim}_{MR}^{(i)} = \cos(v_{\mathrm{query}}, f(I_{MR}^{(i)}))

Top-kk results RMR_M are provided for plan prompting.

  • Operator-RAG Retrieval (k=1k=1):

vquery=f(Tqueryapp),simOR(i)=cos(vquery,f(TOR(i)))v_{\mathrm{query}} = f(T_\mathrm{query}^{app}),\quad \text{sim}_{OR}^{(i)} = \cos(v_{\mathrm{query}}, f(T_{OR}^{(i)}))

The top match ROR_O informs atomic action prompting.

These mechanisms are instantiated in workflow pseudocode for both Manager (planning) and Operator (execution) agents, demonstrating how retrieval integrates into contextual prompting for the MLLM (Zhou et al., 15 Nov 2025).

4. Mobile-Eval-RAG Benchmark and Evaluation Metrics

Mobile-Eval-RAG establishes a task suite of 50 realistic, long-horizon mobile tasks spanning five categories—20 simple (information search, trending analysis) and 30 complex (restaurant recommendation, online shopping, travel planning)—each spanning 2–3 apps with an average of 16.9 steps per task. Evaluation metrics include:

  • Success Rate (SR): Percentage of tasks completed within 30 steps, avoiding repetitive actions and yielding correct judgment.
  • Completion Rate (CR): Proportion of subgoals completed.
  • Operator Accuracy (OA): Correct atomic operations over total steps.
  • Reflector Accuracy (RA): Reflection accuracy per step.
  • Steps and Efficiency (CR divided by steps): Measures process economy.

These metrics jointly adjudicate planning quality, execution fidelity, and overall efficiency (Zhou et al., 15 Nov 2025).

5. Empirical Results and Comparative Analysis

On the Mobile-Eval-RAG benchmark, using Gemini-1.5-Pro backbone, Mobile-Agent-RAG outperforms strong baselines as shown:

Method CR (%) Steps Efficiency SR (%)
Mobile-Agent-E 58.3 22.4 2.60 48.0
Mobile-Agent-E+Evo 61.2 21.8 2.81 56.0
Mobile-Agent-RAG 75.7 18.8 4.03 76.0

Key gains: +11.0 percentage points in completion rate (64.7%75.7%64.7\%\rightarrow75.7\%), +10.2% in efficiency (2.604.032.60\rightarrow4.03). Ablations show removing Manager-RAG results in a ~14 point CR drop; removing Operator-RAG incurs >20 point loss in OA, efficiency, and SR. Results are robust across other MLLMs (e.g., GPT-4o, Claude-3.5). Performance remains stable over repeated trials and gains are well-distributed across all measured dimensions (Zhou et al., 15 Nov 2025).

6. Broader Context and Methodological Comparisons

Mobile-Agent-RAG’s design directly addresses strategical hallucination (high-level planning error) and operational failures (low-level UI mistakes) by aligning retrieval granularity with functional agent roles. The dual-level RAG approach is sharply differentiated from prior agents that rely primarily on monolithic retrieval or static model priors.

Related work in the mobile agent RAG space includes flexible knowledge base documentation via exploration and RAG-driven deployment (AppAgent v2 (Li et al., 5 Aug 2024)), modular RAG architectures (MobileRAG (Loo et al., 4 Sep 2025)), on-device RAG with memory and power optimizations (MobileRAG/EcoVector (Park et al., 1 Jul 2025)), and distributed/edge-collaborative hybrid deployments (EACO-RAG (Li et al., 27 Oct 2024)). Knowledge-augmented fine-tuning further improves agent factuality and robustness by aligning runtime knowledge exposure with training conditions (Cai et al., 28 Jun 2025), but Mobile-Agent-RAG is distinctive in its hierarchical separation of strategic and operational knowledge and explicit dual-level retrieval logic.

7. Significance and Future Directions

Mobile-Agent-RAG establishes a robust, context-aware paradigm for multi-agent mobile automation, providing systematic gains in task completion and process efficiency while addressing generalization limitations of vanilla MLLM-based agents. Empirical evidence demonstrates balanced improvements across planning, execution, reliability, and efficiency. Remaining challenges include scaling specialized retrieval KBs, adapting to on-device resource constraints, and handling highly dynamic or previously unseen UI/app states. Explorations into more sophisticated edge/cloud hybrid deployment, memory-efficient local retrieval caches, and end-to-end co-training of retrieval and language modules represent anticipated avenues to further mature Mobile-Agent-RAG systems (Zhou et al., 15 Nov 2025, Lin et al., 4 Nov 2025, Li et al., 27 Oct 2024).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Mobile-Agent-RAG.