Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAPLE: Mobile GUI Task Reasoning Framework

Updated 28 April 2026
  • MAPLE is a multi-agent framework that uses persistent finite state machines to model mobile GUI navigation and task automation.
  • The system decomposes tasks into planning, execution, error recovery, and knowledge retention phases to ensure robust performance across apps.
  • Empirical evaluations demonstrate MAPLE’s notable improvements in success rate, action accuracy, and recovery success compared to earlier baselines.

MAPLE (Mobile Agent with Persistent Finite State Machines for Structured Task Reasoning) is a state-aware, multi-agent framework that enables autonomous completion of user-instructed tasks across mobile GUI environments. MAPLE abstracts app interactions as a dynamically constructed finite state machine (FSM), computationally modeling each UI screen as a distinct state and user actions as transitions. This facilitates structured representation of app navigation, robust error detection and recovery, and knowledge retention, supporting complex, cross-application mobile task automation (Guo et al., 29 May 2025).

1. System Design and Modular Architecture

MAPLE operates as a modular, multi-agent system layered over a physical or emulated Android device, controlled via the Android Debug Bridge (ADB). The core architecture is partitioned into four interdependent phases: Planning, Execution, Verification & Error Recovery, and Knowledge Retention. Each phase is managed by specialized agents that communicate through prompts to a Multimodal LLM (MLLM). The Actor Agent executes low-level, atomic GUI operations such as Tap, Type, and Swipe via the mobile API.

Phases and Responsible Agents:

  • Planning: The Planner Agent receives a user instruction uu (and optionally reusable knowledge KK), producing a multi-step plan π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k), where each gig_i is a subgoal and rir_i its rationale. A two-stage workflow generates nn candidate plans, which the MLLM scores and selects.
  • Execution: Three agents collaborate: the Screen Parser captures screenshots and performs OCR (DBNet + ConvNextViT-document), icon grounding (GroundingDINO), and icon captioning (Qwen-VL-Plus) to emit perception data pip_i; the State Agent maintains and builds the FSM, prompting the MLLM for state summaries, predicted next screens, pre-conditions (prei+1\mathrm{pre}^{\,i+1}), and post-conditions (posti\mathrm{post}^i); the Actor Agent selects and executes UI actions aia_i.
  • Verification & Error Recovery: The Reflection Agent compares actual vs. predicted states post-action, classifies outcomes, and triggers rollback or replanning logic upon failure.
  • Knowledge Retention: The Mentor Agent analyzes action histories and FSMs post-task, distilling reusable guidance and sequences into the persistent knowledge base KK0.

Orchestration is handled by a lightweight controller that alternates among high-level planning, execution, state-tracking, verification, and recovery.

2. Finite State Machine Formalism

MAPLE's FSM provides a structured, annotated representation of mobile app navigation in real time:

KK1

  • KK2: Set of discovered UI states, KK3 each corresponding to a screen.
  • KK4: Finite set of GUI actions (e.g., tap, type, swipe).
  • KK5: Transition relation; KK6, denoting that action KK7 in KK8 leads to KK9, annotated with pre/post-conditions.
  • π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)0: Initial state, such as the home screen.
  • π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)1: Set of goal states marking subtask completions.

Each π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)2 includes:

  • Natural-language description π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)3 (current screen).
  • Prediction π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)4 (expected next screen).
  • Pre-condition π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)5.
  • Post-condition π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)6.

The FSM is incrementally constructed: upon each screen transition, the MLLM generates descriptions and conditions, after which the State Agent updates the FSM.

3. Specialized Agent Functions and Workflows

The system relies on five agent types, each with dedicated responsibilities and interdependence.

Agent Inputs / Outputs Roles and Workflow Highlights
Planner Agent π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)7, π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)8 → plan π=(g1,r1),...,(gk,rk)\pi = (g_1, r_1), ..., (g_k, r_k)9 Generates, scores, and selects multi-step task decompositions; uses MLLM as judge
Screen Parser Screenshot gig_i0 Performs OCR, icon detection, and segmentation to produce perception gig_i1
State Agent gig_i2, gig_i3 Updates FSM, prompts MLLM for state descriptors and pre/post-conditions
Actor Agent gig_i4, gig_i5 Maps subgoals and perceptions to actions gig_i6; executes actions over ADB
Reflection Agent gig_i7, FSM Compares predicted and observed states; manages rollback, recovery, and replanning
Mentor Agent FSM, logs Extracts reusable action/guidance sequences and stores them in memory gig_i8

Workflow includes multi-candidate plan generation and ranking, dynamic FSM augmentation based on live perception and subgoal context, and robust error-handling via explicit state tracking and recovery logic.

4. Dynamic FSM Construction Algorithm

At each task step gig_i9, MAPLE runs an UpdateFSM procedure. The workflow:

  1. Initialize rir_i0 for the home screen if rir_i1.
  2. Prompt the MLLM with rir_i2, rir_i3 to obtain rir_i4, predicted rir_i5, rir_i6, and rir_i7.
  3. Create or retrieve state nodes rir_i8 and rir_i9 matching nn0, nn1.
  4. Add or augment transition nn2 in nn3.
  5. Actor Agent selects/executes nn4.
  6. Return the updated FSM and action.

This procedure enables real-time construction of navigation graphs, integration of state and action semantics, and structured context awareness during task execution.

5. Evaluation Methodology and Empirical Performance

MAPLE was evaluated on two challenging benchmarks:

  • Mobile-Eval-E: 25 tasks (19 cross-app), 15 apps, 364 reference actions.
  • SPA-Bench: 20 English cross-app tasks, 25 apps, 262 reference actions.

Measured Metrics:

  1. Success Rate (SR): percentage of tasks fully completed.
  2. Satisfaction Score (SS): rubric item completion fraction.
  3. Action Accuracy (AA): alignment with human action trajectories.
  4. Termination Rate (TR): percentage of prematurely aborted tasks.
  5. Recovery Success (RS): fraction of failed subtasks successfully recovered.

Results compared to Mobile-Agent-E + Evo baseline:

Metric Mobile-Eval-E SPA-Bench
SS 86.15% (+7.18 pp) 88.64% (+8.33 pp)
AA 83.24% (+6.59 pp) 84.35% (+6.49 pp)
TR 16.00% (–8.00 pp) 20.00% (–5.00 pp)
SR 84.00% (+12.00 pp) 80.00% (+5.00 pp)
RS 71.88% (+4.53 pp) 66.67% (+13.81 pp)

Ablation studies demonstrated that removing any key MAPLE component (Planner Agent, multi-plan selection, pre/post-conditions, Mentor Agent) led to substantial performance degradation (e.g., SR as low as 45–52% on SPA-Bench), confirming the necessity and synergy of all core modules. With respect to LLM backbones, GPT-4o yielded the strongest results, but MAPLE maintained superiority over baselines even with weaker models (Claude-3.5, Gemini-1.5-Pro) (Guo et al., 29 May 2025).

6. Role and Impact of Structured FSM Memory

MAPLE’s FSM memory delivers multiple functional advantages:

  • Context Tracking: By recording visited and predicted states, the agent maintains navigation context over extended app flows, mitigating redundant actions and loops.
  • Error Detection: Explicit pre- and post-condition annotations provide precise criteria for detecting and diagnosing execution failures, instead of relying solely on perception deltas.
  • Robust Recovery: The FSM encodes reliable rollback points and recovery transitions; notably, this design increased Recovery Success on SPA-Bench by up to 13.8 percentage points.
  • Cross-task Knowledge Transfer: Persisting FSM graphs and distilled guidance cues in long-term memory nn5 accelerates subsequent planning and execution for similar app flows.

These mechanisms collectively demonstrate that a lightweight, model-agnostic FSM memory augments MLLM-based GUI agents, yielding improved structured planning, real-time verification, and reliable error recovery. The FSM-centric design is model-agnostic and can serve as a memory layer for future mobile GUI agent architectures (Guo et al., 29 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MAPLE (Mobile GUI Task Reasoning).