- The paper introduces Mobile-Agent-E, a hierarchical framework that improves mobile agent reasoning and task planning through self-evolution.
- It employs subordinate agents—Perceptor, Operator, Action Reflector, and Notetaker—to decompose tasks and update strategies via persistent memory.
- Mobile-Agent-E outperforms prior approaches with a 22.1% gain and further boosts efficiency by 6.5% when self-evolution is enabled.
The paper introduces Mobile-Agent-E, a hierarchical multi-agent framework designed to enhance the performance and efficiency of mobile agents in complex, real-world tasks. The key innovation lies in its capacity for self-evolution through the utilization of past experiences, addressing limitations of existing mobile agents that struggle with reasoning-intensive, long-horizon tasks and lack mechanisms for continuous learning.
The Mobile-Agent-E framework consists of a Manager and four subordinate agents: Perceptor (AP), Operator, Action Reflector, and Notetaker. The Manager is responsible for high-level planning, decomposing complex tasks into subgoals. The Perceptor (AP) handles fine-grained visual perception using OCR, icon grounding, and icon captioning models to generate a list of texts and icons with corresponding coordinates WVt from the screenshot st. The Operator executes immediate actions. The Action Reflector verifies action outcomes, and the Notetaker aggregates information. This hierarchical structure facilitates improved long-term planning and error recovery.
The framework incorporates a self-evolution module that maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidelines derived from previous interactions, akin to lessons encoded in episodic memory. Shortcuts are executable sequences of atomic operations tailored for specific subroutines, analogous to procedural knowledge. After each task, Experience Reflectors, specifically AES for Shortcuts and AET for Tips, update these components based on the interaction history. This allows the Manager and Operator to refine their planning and action decisions over time.
The paper introduces Mobile-Eval-E, a new benchmark designed to evaluate complex, real-world mobile tasks. It includes more than twice the number of expected operations per task compared to existing benchmarks. The Satisfaction Score (SS) metric is introduced to address the lack of binary success flags in real-world tasks, computed from human-written rubrics that account for milestone completion and exploratory behaviors. The Satisfaction Score vs Steps (SSS) curve is proposed to evaluate the efficiency of mobile agents.
The Manager's planning process is mathematically represented as:
WPt,WSt=AM(I,st,WPt−1,WSt−1,WGt−1,WNt−1,LS) if t≥0 and WEFt−1 == False
$W_P^t, W_S^t = \mathcal{A}_M(I, s_t, W_P^{t-1}, W_S^{t-1}, W_G^{t-1}, W_N^{t-1}, L_S, \mathbf{W_{E}[-k:]) \text{ if$t \geq kandW_{EF}{t-1}$== True}$
Where:
- WPt is the overall plan at time t
- WSt is the current subgoal at time t
- AM is the Manager agent
- I is the input task query
- st is the phone state (screenshot) at time t
- WPt−1 is the previous overall plan
- WSt−1 is the previous subgoal
- WGt−1 is the progress status at time t−1
- WNt−1 are the important notes at time t−1
- LS are the Shortcuts
- WEFt−1 is the Error Escalation Flag at time t−1
- WE[−k:] represents recent errors
The Operator's action decision is formalized as:
at=AO(I,st,WVt,WPt,WSt,WGt,WNt,WA[−m:],WE[−m:],LS,LT)
Where:
- at is the action at time t
- AO is the Operator agent
- I is the input task query
- st is the phone state (screenshot) at time t
- WVt is the visual perception result at time t
- WPt is the overall plan at time t
- WSt is the current subgoal at time t
- WGt is the progress status at time t
- WNt are the important notes at time t
- WA[−m:] is a history of the latest m actions
- WE[−m:] is a history of the latest m errors
- LS are the Shortcuts
- LT are the Tips
The Perceptor's visual perception process is:
WVt=AP(st)
Where:
- WVt is the visual perception result at time t
- AP is the Perceptor agent
- st is the phone state (screenshot) at time t
The Action Reflector's outcome verification and update process is:
WVt+1=AP(st+1)
WA[t],WE[t],WGt=AR(I,st,WVt,st+1,WVt+1,at,WSt,WGt−1)
Where:
- WVt+1 is the visual perception result at time t+1
- AP is the Perceptor agent
- st+1 is the phone state (screenshot) at time t+1
- WA[t] is the updated action history at time t
- WE[t] is the updated error history at time t
- WGt is the progress status at time t
- AR is the Action Reflector agent
- I is the input task query
- st is the phone state (screenshot) at time t
- WVt is the visual perception result at time t
- at is the action at time t
- WSt is the current subgoal at time t
- WGt−1 is the previous progress status
The Notetaker's information aggregation is:
WNt=AN(I,st+1,WVt+1,WPt,WSt,WGt,WNt−1)
Where:
- WNt are the important notes at time t
- AN is the Notetaker agent
- I is the input task query
- st+1 is the phone state (screenshot) at time t+1
- WVt+1 is the visual perception result at time t+1
- WPt is the overall plan at time t
- WSt is the current subgoal at time t
- WGt is the progress status at time t
- WNt−1 are the important notes at time t−1
The Experience Reflector's update of Tips and Shortcuts is:
LT=AET(I,WPτ,WGτ,WA,WE,TF,LT)
LS=AES(I,WPτ,WGτ,WA,WE,TF,LS)
Where:
- LT are the Tips
- AET is the Experience Reflector for Tips
- LS are the Shortcuts
- AES is the Experience Reflector for Shortcuts
- I is the input task query
- WPτ is the final overall plan
- WGτ is the final progress status
- WA is the action history
- WE is the error history
- TF is a list of future tasks
Empirical results demonstrate that Mobile-Agent-E achieves a 22.1\% average absolute gain over previous state-of-the-art approaches across three different foundation model backbones, including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro. Enabling self-evolution (Mobile-Agent-E + Evo) results in a 6.5\% absolute improvement compared to no evolution, along with a reduction in computational overhead due to the incorporation of Shortcuts. The progressive impact of self-evolution is shown through the increased benefits observed in later tasks. The use of evolved Tips also contributes distinct benefits to the model's performance.