Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks (2501.11733v2)

Published 20 Jan 2025 in cs.CL and cs.CV

Abstract: Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.

Summary

The paper introduces Mobile-Agent-E, a hierarchical framework that improves mobile agent reasoning and task planning through self-evolution.
It employs subordinate agents—Perceptor, Operator, Action Reflector, and Notetaker—to decompose tasks and update strategies via persistent memory.
Mobile-Agent-E outperforms prior approaches with a 22.1% gain and further boosts efficiency by 6.5% when self-evolution is enabled.

The paper introduces Mobile-Agent-E, a hierarchical multi-agent framework designed to enhance the performance and efficiency of mobile agents in complex, real-world tasks. The key innovation lies in its capacity for self-evolution through the utilization of past experiences, addressing limitations of existing mobile agents that struggle with reasoning-intensive, long-horizon tasks and lack mechanisms for continuous learning.

The Mobile-Agent-E framework consists of a Manager and four subordinate agents: Perceptor ( $\mathcal{A}_P$ ), Operator, Action Reflector, and Notetaker. The Manager is responsible for high-level planning, decomposing complex tasks into subgoals. The Perceptor ( $\mathcal{A}_P$ ) handles fine-grained visual perception using OCR, icon grounding, and icon captioning models to generate a list of texts and icons with corresponding coordinates $W_V^t$ from the screenshot $s_t$ . The Operator executes immediate actions. The Action Reflector verifies action outcomes, and the Notetaker aggregates information. This hierarchical structure facilitates improved long-term planning and error recovery.

The framework incorporates a self-evolution module that maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidelines derived from previous interactions, akin to lessons encoded in episodic memory. Shortcuts are executable sequences of atomic operations tailored for specific subroutines, analogous to procedural knowledge. After each task, Experience Reflectors, specifically $\mathcal{A}_{ES}$ for Shortcuts and $\mathcal{A}_{ET}$ for Tips, update these components based on the interaction history. This allows the Manager and Operator to refine their planning and action decisions over time.

The paper introduces Mobile-Eval-E, a new benchmark designed to evaluate complex, real-world mobile tasks. It includes more than twice the number of expected operations per task compared to existing benchmarks. The Satisfaction Score (SS) metric is introduced to address the lack of binary success flags in real-world tasks, computed from human-written rubrics that account for milestone completion and exploratory behaviors. The Satisfaction Score vs Steps (SSS) curve is proposed to evaluate the efficiency of mobile agents.

The Manager's planning process is mathematically represented as: $W_P^t, W_S^t = \mathcal{A}_M(I, s_t, W_P^{t-1}, W_S^{t-1}, W_G^{t-1}, W_N^{t-1}, L_S)$ if $t \geq 0$ and $W_{EF}^{t-1}$ == False

$W_P^t, W_S^t = \mathcal{A}_M(I, s_t, W_P^{t-1}, W_S^{t-1}, W_G^{t-1}, W_N^{t-1}, L_S, \mathbf{W_{E}[-k:]) \text{ if$t \geq k $and$ W_{EF}^{{t-1}$== True}$}

Where:

$W_P^t$ is the overall plan at time $t$
$W_S^t$ is the current subgoal at time $t$
$\mathcal{A}_M$ is the Manager agent
$I$ is the input task query
$s_t$ is the phone state (screenshot) at time $t$
$W_P^{t-1}$ is the previous overall plan
$W_S^{t-1}$ is the previous subgoal
$W_G^{t-1}$ is the progress status at time $t-1$
$W_N^{t-1}$ are the important notes at time $t-1$
$L_S$ are the Shortcuts
$W_{EF}^{t-1}$ is the Error Escalation Flag at time $t-1$
$\mathbf{W_{E}[-k:]}$ represents recent errors

The Operator's action decision is formalized as:

$a_t = \mathcal{A}_O (I, s_t, W_V^t, W_P^t, W_S^t, W_G^t, W_N^t, \mathbf{W_A}[-m:], \mathbf{W_E}[-m:], L_S, L_T)$

Where:

$a_t$ is the action at time $t$
$\mathcal{A}_O$ is the Operator agent
$I$ is the input task query
$s_t$ is the phone state (screenshot) at time $t$
$W_V^t$ is the visual perception result at time $t$
$W_P^t$ is the overall plan at time $t$
$W_S^t$ is the current subgoal at time $t$
$W_G^t$ is the progress status at time $t$
$W_N^t$ are the important notes at time $t$
$\mathbf{W_A}[-m:]$ is a history of the latest $m$ actions
$\mathbf{W_E}[-m:]$ is a history of the latest $m$ errors
$L_S$ are the Shortcuts
$L_T$ are the Tips

The Perceptor's visual perception process is:

$W_V^t = \mathcal{A}_P (s_t)$

Where:

$W_V^t$ is the visual perception result at time $t$
$\mathcal{A}_P$ is the Perceptor agent
$s_t$ is the phone state (screenshot) at time $t$

The Action Reflector's outcome verification and update process is:

$W_V^{t+1} = \mathcal{A}_P (s_{t+1})$

$\mathbf{W_A}[t], \mathbf{W_E}[t], W_G^t = \mathcal{A}_R (I, s_t, W_V^t, s_{t+1}, W_V^{t+1}, a_t, W_S^t, W_G^{t-1})$

Where:

$W_V^{t+1}$ is the visual perception result at time $t+1$
$\mathcal{A}_P$ is the Perceptor agent
$s_{t+1}$ is the phone state (screenshot) at time $t+1$
$\mathbf{W_A}[t]$ is the updated action history at time $t$
$\mathbf{W_E}[t]$ is the updated error history at time $t$
$W_G^t$ is the progress status at time $t$
$\mathcal{A}_R$ is the Action Reflector agent
$I$ is the input task query
$s_t$ is the phone state (screenshot) at time $t$
$W_V^t$ is the visual perception result at time $t$
$a_t$ is the action at time $t$
$W_S^t$ is the current subgoal at time $t$
$W_G^{t-1}$ is the previous progress status

The Notetaker's information aggregation is:

$W_N^t = \mathcal{A}_N (I, s_{t+1}, W_V^{t+1}, W_P^t, W_S^t, W_G^t, W_N^{t-1})$

Where:

$W_N^t$ are the important notes at time $t$
$\mathcal{A}_N$ is the Notetaker agent
$I$ is the input task query
$s_{t+1}$ is the phone state (screenshot) at time $t+1$
$W_V^{t+1}$ is the visual perception result at time $t+1$
$W_P^t$ is the overall plan at time $t$
$W_S^t$ is the current subgoal at time $t$
$W_G^t$ is the progress status at time $t$
$W_N^{t-1}$ are the important notes at time $t-1$

The Experience Reflector's update of Tips and Shortcuts is:

$L_T = \mathcal{A}_{ET} (I, W_P^\tau, W_G^\tau, \mathbf{W_A}, \mathbf{W_E}, T_F, L_T)$

$L_S = \mathcal{A}_{ES} (I, W_P^\tau, W_G^\tau, \mathbf{W_A}, \mathbf{W_E}, T_F, L_S)$

Where:

$L_T$ are the Tips
$\mathcal{A}_{ET}$ is the Experience Reflector for Tips
$L_S$ are the Shortcuts
$\mathcal{A}_{ES}$ is the Experience Reflector for Shortcuts
$I$ is the input task query
$W_P^\tau$ is the final overall plan
$W_G^\tau$ is the final progress status
$\mathbf{W_A}$ is the action history
$\mathbf{W_E}$ is the error history
$T_F$ is a list of future tasks

Empirical results demonstrate that Mobile-Agent-E achieves a 22.1\% average absolute gain over previous state-of-the-art approaches across three different foundation model backbones, including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro. Enabling self-evolution (Mobile-Agent-E + Evo) results in a 6.5\% absolute improvement compared to no evolution, along with a reduction in computational overhead due to the incorporation of Shortcuts. The progressive impact of self-evolution is shown through the increased benefits observed in later tasks. The use of evolved Tips also contributes distinct benefits to the model's performance.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1881937938042462403

https://twitter.com/TheTuringPost/status/1884028900583600265

https://twitter.com/TheTuringPost/status/1884168527508296009

https://twitter.com/javaeeeee1/status/1882020692964876645

Reddit

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks (23 points, 2 comments)