Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks (2501.11733v2)

Published 20 Jan 2025 in cs.CL and cs.CV

Abstract: Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.

Summary

  • The paper introduces Mobile-Agent-E, a hierarchical framework that improves mobile agent reasoning and task planning through self-evolution.
  • It employs subordinate agents—Perceptor, Operator, Action Reflector, and Notetaker—to decompose tasks and update strategies via persistent memory.
  • Mobile-Agent-E outperforms prior approaches with a 22.1% gain and further boosts efficiency by 6.5% when self-evolution is enabled.

The paper introduces Mobile-Agent-E, a hierarchical multi-agent framework designed to enhance the performance and efficiency of mobile agents in complex, real-world tasks. The key innovation lies in its capacity for self-evolution through the utilization of past experiences, addressing limitations of existing mobile agents that struggle with reasoning-intensive, long-horizon tasks and lack mechanisms for continuous learning.

The Mobile-Agent-E framework consists of a Manager and four subordinate agents: Perceptor (AP\mathcal{A}_P), Operator, Action Reflector, and Notetaker. The Manager is responsible for high-level planning, decomposing complex tasks into subgoals. The Perceptor (AP\mathcal{A}_P) handles fine-grained visual perception using OCR, icon grounding, and icon captioning models to generate a list of texts and icons with corresponding coordinates WVtW_V^t from the screenshot sts_t. The Operator executes immediate actions. The Action Reflector verifies action outcomes, and the Notetaker aggregates information. This hierarchical structure facilitates improved long-term planning and error recovery.

The framework incorporates a self-evolution module that maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidelines derived from previous interactions, akin to lessons encoded in episodic memory. Shortcuts are executable sequences of atomic operations tailored for specific subroutines, analogous to procedural knowledge. After each task, Experience Reflectors, specifically AES\mathcal{A}_{ES} for Shortcuts and AET\mathcal{A}_{ET} for Tips, update these components based on the interaction history. This allows the Manager and Operator to refine their planning and action decisions over time.

The paper introduces Mobile-Eval-E, a new benchmark designed to evaluate complex, real-world mobile tasks. It includes more than twice the number of expected operations per task compared to existing benchmarks. The Satisfaction Score (SS) metric is introduced to address the lack of binary success flags in real-world tasks, computed from human-written rubrics that account for milestone completion and exploratory behaviors. The Satisfaction Score vs Steps (SSS) curve is proposed to evaluate the efficiency of mobile agents.

The Manager's planning process is mathematically represented as: WPt,WSt=AM(I,st,WPt1,WSt1,WGt1,WNt1,LS)W_P^t, W_S^t = \mathcal{A}_M(I, s_t, W_P^{t-1}, W_S^{t-1}, W_G^{t-1}, W_N^{t-1}, L_S) if t0t \geq 0 and WEFt1W_{EF}^{t-1} == False

$W_P^t, W_S^t = \mathcal{A}_M(I, s_t, W_P^{t-1}, W_S^{t-1}, W_G^{t-1}, W_N^{t-1}, L_S, \mathbf{W_{E}[-k:]) \text{ if$t \geq kandandW_{EF}{t-1}$== True}$

Where:

  • WPtW_P^t is the overall plan at time tt
  • WStW_S^t is the current subgoal at time tt
  • AM\mathcal{A}_M is the Manager agent
  • II is the input task query
  • sts_t is the phone state (screenshot) at time tt
  • WPt1W_P^{t-1} is the previous overall plan
  • WSt1W_S^{t-1} is the previous subgoal
  • WGt1W_G^{t-1} is the progress status at time t1t-1
  • WNt1W_N^{t-1} are the important notes at time t1t-1
  • LSL_S are the Shortcuts
  • WEFt1W_{EF}^{t-1} is the Error Escalation Flag at time t1t-1
  • WE[k:]\mathbf{W_{E}[-k:]} represents recent errors

The Operator's action decision is formalized as:

at=AO(I,st,WVt,WPt,WSt,WGt,WNt,WA[m:],WE[m:],LS,LT)a_t = \mathcal{A}_O (I, s_t, W_V^t, W_P^t, W_S^t, W_G^t, W_N^t, \mathbf{W_A}[-m:], \mathbf{W_E}[-m:], L_S, L_T)

Where:

  • ata_t is the action at time tt
  • AO\mathcal{A}_O is the Operator agent
  • II is the input task query
  • sts_t is the phone state (screenshot) at time tt
  • WVtW_V^t is the visual perception result at time tt
  • WPtW_P^t is the overall plan at time tt
  • WStW_S^t is the current subgoal at time tt
  • WGtW_G^t is the progress status at time tt
  • WNtW_N^t are the important notes at time tt
  • WA[m:]\mathbf{W_A}[-m:] is a history of the latest mm actions
  • WE[m:]\mathbf{W_E}[-m:] is a history of the latest mm errors
  • LSL_S are the Shortcuts
  • LTL_T are the Tips

The Perceptor's visual perception process is:

WVt=AP(st)W_V^t = \mathcal{A}_P (s_t)

Where:

  • WVtW_V^t is the visual perception result at time tt
  • AP\mathcal{A}_P is the Perceptor agent
  • sts_t is the phone state (screenshot) at time tt

The Action Reflector's outcome verification and update process is:

WVt+1=AP(st+1)W_V^{t+1} = \mathcal{A}_P (s_{t+1})

WA[t],WE[t],WGt=AR(I,st,WVt,st+1,WVt+1,at,WSt,WGt1)\mathbf{W_A}[t], \mathbf{W_E}[t], W_G^t = \mathcal{A}_R (I, s_t, W_V^t, s_{t+1}, W_V^{t+1}, a_t, W_S^t, W_G^{t-1})

Where:

  • WVt+1W_V^{t+1} is the visual perception result at time t+1t+1
  • AP\mathcal{A}_P is the Perceptor agent
  • st+1s_{t+1} is the phone state (screenshot) at time t+1t+1
  • WA[t]\mathbf{W_A}[t] is the updated action history at time tt
  • WE[t]\mathbf{W_E}[t] is the updated error history at time tt
  • WGtW_G^t is the progress status at time tt
  • AR\mathcal{A}_R is the Action Reflector agent
  • II is the input task query
  • sts_t is the phone state (screenshot) at time tt
  • WVtW_V^t is the visual perception result at time tt
  • ata_t is the action at time tt
  • WStW_S^t is the current subgoal at time tt
  • WGt1W_G^{t-1} is the previous progress status

The Notetaker's information aggregation is:

WNt=AN(I,st+1,WVt+1,WPt,WSt,WGt,WNt1)W_N^t = \mathcal{A}_N (I, s_{t+1}, W_V^{t+1}, W_P^t, W_S^t, W_G^t, W_N^{t-1})

Where:

  • WNtW_N^t are the important notes at time tt
  • AN\mathcal{A}_N is the Notetaker agent
  • II is the input task query
  • st+1s_{t+1} is the phone state (screenshot) at time t+1t+1
  • WVt+1W_V^{t+1} is the visual perception result at time t+1t+1
  • WPtW_P^t is the overall plan at time tt
  • WStW_S^t is the current subgoal at time tt
  • WGtW_G^t is the progress status at time tt
  • WNt1W_N^{t-1} are the important notes at time t1t-1

The Experience Reflector's update of Tips and Shortcuts is:

LT=AET(I,WPτ,WGτ,WA,WE,TF,LT)L_T = \mathcal{A}_{ET} (I, W_P^\tau, W_G^\tau, \mathbf{W_A}, \mathbf{W_E}, T_F, L_T)

LS=AES(I,WPτ,WGτ,WA,WE,TF,LS)L_S = \mathcal{A}_{ES} (I, W_P^\tau, W_G^\tau, \mathbf{W_A}, \mathbf{W_E}, T_F, L_S)

Where:

  • LTL_T are the Tips
  • AET\mathcal{A}_{ET} is the Experience Reflector for Tips
  • LSL_S are the Shortcuts
  • AES\mathcal{A}_{ES} is the Experience Reflector for Shortcuts
  • II is the input task query
  • WPτW_P^\tau is the final overall plan
  • WGτW_G^\tau is the final progress status
  • WA\mathbf{W_A} is the action history
  • WE\mathbf{W_E} is the error history
  • TFT_F is a list of future tasks

Empirical results demonstrate that Mobile-Agent-E achieves a 22.1\% average absolute gain over previous state-of-the-art approaches across three different foundation model backbones, including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro. Enabling self-evolution (Mobile-Agent-E + Evo) results in a 6.5\% absolute improvement compared to no evolution, along with a reduction in computational overhead due to the incorporation of Shortcuts. The progressive impact of self-evolution is shown through the increased benefits observed in later tasks. The use of evolved Tips also contributes distinct benefits to the model's performance.

Reddit Logo Streamline Icon: https://streamlinehq.com