MAPLE: Mobile GUI Task Reasoning Framework
- MAPLE is a multi-agent framework that uses persistent finite state machines to model mobile GUI navigation and task automation.
- The system decomposes tasks into planning, execution, error recovery, and knowledge retention phases to ensure robust performance across apps.
- Empirical evaluations demonstrate MAPLE’s notable improvements in success rate, action accuracy, and recovery success compared to earlier baselines.
MAPLE (Mobile Agent with Persistent Finite State Machines for Structured Task Reasoning) is a state-aware, multi-agent framework that enables autonomous completion of user-instructed tasks across mobile GUI environments. MAPLE abstracts app interactions as a dynamically constructed finite state machine (FSM), computationally modeling each UI screen as a distinct state and user actions as transitions. This facilitates structured representation of app navigation, robust error detection and recovery, and knowledge retention, supporting complex, cross-application mobile task automation (Guo et al., 29 May 2025).
1. System Design and Modular Architecture
MAPLE operates as a modular, multi-agent system layered over a physical or emulated Android device, controlled via the Android Debug Bridge (ADB). The core architecture is partitioned into four interdependent phases: Planning, Execution, Verification & Error Recovery, and Knowledge Retention. Each phase is managed by specialized agents that communicate through prompts to a Multimodal LLM (MLLM). The Actor Agent executes low-level, atomic GUI operations such as Tap, Type, and Swipe via the mobile API.
Phases and Responsible Agents:
- Planning: The Planner Agent receives a user instruction (and optionally reusable knowledge ), producing a multi-step plan , where each is a subgoal and its rationale. A two-stage workflow generates candidate plans, which the MLLM scores and selects.
- Execution: Three agents collaborate: the Screen Parser captures screenshots and performs OCR (DBNet + ConvNextViT-document), icon grounding (GroundingDINO), and icon captioning (Qwen-VL-Plus) to emit perception data ; the State Agent maintains and builds the FSM, prompting the MLLM for state summaries, predicted next screens, pre-conditions (), and post-conditions (); the Actor Agent selects and executes UI actions .
- Verification & Error Recovery: The Reflection Agent compares actual vs. predicted states post-action, classifies outcomes, and triggers rollback or replanning logic upon failure.
- Knowledge Retention: The Mentor Agent analyzes action histories and FSMs post-task, distilling reusable guidance and sequences into the persistent knowledge base 0.
Orchestration is handled by a lightweight controller that alternates among high-level planning, execution, state-tracking, verification, and recovery.
2. Finite State Machine Formalism
MAPLE's FSM provides a structured, annotated representation of mobile app navigation in real time:
1
- 2: Set of discovered UI states, 3 each corresponding to a screen.
- 4: Finite set of GUI actions (e.g., tap, type, swipe).
- 5: Transition relation; 6, denoting that action 7 in 8 leads to 9, annotated with pre/post-conditions.
- 0: Initial state, such as the home screen.
- 1: Set of goal states marking subtask completions.
Each 2 includes:
- Natural-language description 3 (current screen).
- Prediction 4 (expected next screen).
- Pre-condition 5.
- Post-condition 6.
The FSM is incrementally constructed: upon each screen transition, the MLLM generates descriptions and conditions, after which the State Agent updates the FSM.
3. Specialized Agent Functions and Workflows
The system relies on five agent types, each with dedicated responsibilities and interdependence.
| Agent | Inputs / Outputs | Roles and Workflow Highlights |
|---|---|---|
| Planner Agent | 7, 8 → plan 9 | Generates, scores, and selects multi-step task decompositions; uses MLLM as judge |
| Screen Parser | Screenshot 0 | Performs OCR, icon detection, and segmentation to produce perception 1 |
| State Agent | 2, 3 | Updates FSM, prompts MLLM for state descriptors and pre/post-conditions |
| Actor Agent | 4, 5 | Maps subgoals and perceptions to actions 6; executes actions over ADB |
| Reflection Agent | 7, FSM | Compares predicted and observed states; manages rollback, recovery, and replanning |
| Mentor Agent | FSM, logs | Extracts reusable action/guidance sequences and stores them in memory 8 |
Workflow includes multi-candidate plan generation and ranking, dynamic FSM augmentation based on live perception and subgoal context, and robust error-handling via explicit state tracking and recovery logic.
4. Dynamic FSM Construction Algorithm
At each task step 9, MAPLE runs an UpdateFSM procedure. The workflow:
- Initialize 0 for the home screen if 1.
- Prompt the MLLM with 2, 3 to obtain 4, predicted 5, 6, and 7.
- Create or retrieve state nodes 8 and 9 matching 0, 1.
- Add or augment transition 2 in 3.
- Actor Agent selects/executes 4.
- Return the updated FSM and action.
This procedure enables real-time construction of navigation graphs, integration of state and action semantics, and structured context awareness during task execution.
5. Evaluation Methodology and Empirical Performance
MAPLE was evaluated on two challenging benchmarks:
- Mobile-Eval-E: 25 tasks (19 cross-app), 15 apps, 364 reference actions.
- SPA-Bench: 20 English cross-app tasks, 25 apps, 262 reference actions.
Measured Metrics:
- Success Rate (SR): percentage of tasks fully completed.
- Satisfaction Score (SS): rubric item completion fraction.
- Action Accuracy (AA): alignment with human action trajectories.
- Termination Rate (TR): percentage of prematurely aborted tasks.
- Recovery Success (RS): fraction of failed subtasks successfully recovered.
Results compared to Mobile-Agent-E + Evo baseline:
| Metric | Mobile-Eval-E | SPA-Bench |
|---|---|---|
| SS | 86.15% (+7.18 pp) | 88.64% (+8.33 pp) |
| AA | 83.24% (+6.59 pp) | 84.35% (+6.49 pp) |
| TR | 16.00% (–8.00 pp) | 20.00% (–5.00 pp) |
| SR | 84.00% (+12.00 pp) | 80.00% (+5.00 pp) |
| RS | 71.88% (+4.53 pp) | 66.67% (+13.81 pp) |
Ablation studies demonstrated that removing any key MAPLE component (Planner Agent, multi-plan selection, pre/post-conditions, Mentor Agent) led to substantial performance degradation (e.g., SR as low as 45–52% on SPA-Bench), confirming the necessity and synergy of all core modules. With respect to LLM backbones, GPT-4o yielded the strongest results, but MAPLE maintained superiority over baselines even with weaker models (Claude-3.5, Gemini-1.5-Pro) (Guo et al., 29 May 2025).
6. Role and Impact of Structured FSM Memory
MAPLE’s FSM memory delivers multiple functional advantages:
- Context Tracking: By recording visited and predicted states, the agent maintains navigation context over extended app flows, mitigating redundant actions and loops.
- Error Detection: Explicit pre- and post-condition annotations provide precise criteria for detecting and diagnosing execution failures, instead of relying solely on perception deltas.
- Robust Recovery: The FSM encodes reliable rollback points and recovery transitions; notably, this design increased Recovery Success on SPA-Bench by up to 13.8 percentage points.
- Cross-task Knowledge Transfer: Persisting FSM graphs and distilled guidance cues in long-term memory 5 accelerates subsequent planning and execution for similar app flows.
These mechanisms collectively demonstrate that a lightweight, model-agnostic FSM memory augments MLLM-based GUI agents, yielding improved structured planning, real-time verification, and reliable error recovery. The FSM-centric design is model-agnostic and can serve as a memory layer for future mobile GUI agent architectures (Guo et al., 29 May 2025).