MAPLE: Mobile GUI Task Reasoning Framework

Updated 28 April 2026

MAPLE is a multi-agent framework that uses persistent finite state machines to model mobile GUI navigation and task automation.
The system decomposes tasks into planning, execution, error recovery, and knowledge retention phases to ensure robust performance across apps.
Empirical evaluations demonstrate MAPLE’s notable improvements in success rate, action accuracy, and recovery success compared to earlier baselines.

MAPLE (Mobile Agent with Persistent Finite State Machines for Structured Task Reasoning) is a state-aware, multi-agent framework that enables autonomous completion of user-instructed tasks across mobile GUI environments. MAPLE abstracts app interactions as a dynamically constructed finite state machine (FSM), computationally modeling each UI screen as a distinct state and user actions as transitions. This facilitates structured representation of app navigation, robust error detection and recovery, and knowledge retention, supporting complex, cross-application mobile task automation (Guo et al., 29 May 2025).

1. System Design and Modular Architecture

MAPLE operates as a modular, multi-agent system layered over a physical or emulated Android device, controlled via the Android Debug Bridge (ADB). The core architecture is partitioned into four interdependent phases: Planning, Execution, Verification & Error Recovery, and Knowledge Retention. Each phase is managed by specialized agents that communicate through prompts to a Multimodal LLM (MLLM). The Actor Agent executes low-level, atomic GUI operations such as Tap, Type, and Swipe via the mobile API.

Phases and Responsible Agents:

Planning: The Planner Agent receives a user instruction $u$ (and optionally reusable knowledge $K$ ), producing a multi-step plan $\pi = (g_1, r_1), ..., (g_k, r_k)$ , where each $g_i$ is a subgoal and $r_i$ its rationale. A two-stage workflow generates $n$ candidate plans, which the MLLM scores and selects.
Execution: Three agents collaborate: the Screen Parser captures screenshots and performs OCR (DBNet + ConvNextViT-document), icon grounding (GroundingDINO), and icon captioning (Qwen-VL-Plus) to emit perception data $p_i$ ; the State Agent maintains and builds the FSM, prompting the MLLM for state summaries, predicted next screens, pre-conditions ( $\mathrm{pre}^{\,i+1}$ ), and post-conditions ( $\mathrm{post}^i$ ); the Actor Agent selects and executes UI actions $a_i$ .
Verification & Error Recovery: The Reflection Agent compares actual vs. predicted states post-action, classifies outcomes, and triggers rollback or replanning logic upon failure.
Knowledge Retention: The Mentor Agent analyzes action histories and FSMs post-task, distilling reusable guidance and sequences into the persistent knowledge base $K$ 0.

Orchestration is handled by a lightweight controller that alternates among high-level planning, execution, state-tracking, verification, and recovery.

2. Finite State Machine Formalism

MAPLE's FSM provides a structured, annotated representation of mobile app navigation in real time:

$K$ 1

$K$ 2: Set of discovered UI states, $K$ 3 each corresponding to a screen.
$K$ 4: Finite set of GUI actions (e.g., tap, type, swipe).
$K$ 5: Transition relation; $K$ 6, denoting that action $K$ 7 in $K$ 8 leads to $K$ 9, annotated with pre/post-conditions.
$\pi = (g_1, r_1), ..., (g_k, r_k)$ 0: Initial state, such as the home screen.
$\pi = (g_1, r_1), ..., (g_k, r_k)$ 1: Set of goal states marking subtask completions.

Each $\pi = (g_1, r_1), ..., (g_k, r_k)$ 2 includes:

Natural-language description $\pi = (g_1, r_1), ..., (g_k, r_k)$ 3 (current screen).
Prediction $\pi = (g_1, r_1), ..., (g_k, r_k)$ 4 (expected next screen).
Pre-condition $\pi = (g_1, r_1), ..., (g_k, r_k)$ 5.
Post-condition $\pi = (g_1, r_1), ..., (g_k, r_k)$ 6.

The FSM is incrementally constructed: upon each screen transition, the MLLM generates descriptions and conditions, after which the State Agent updates the FSM.

3. Specialized Agent Functions and Workflows

The system relies on five agent types, each with dedicated responsibilities and interdependence.

Agent	Inputs / Outputs	Roles and Workflow Highlights
Planner Agent	$\pi = (g_1, r_1), ..., (g_k, r_k)$ 7, $\pi = (g_1, r_1), ..., (g_k, r_k)$ 8 → plan $\pi = (g_1, r_1), ..., (g_k, r_k)$ 9	Generates, scores, and selects multi-step task decompositions; uses MLLM as judge
Screen Parser	Screenshot $g_i$ 0	Performs OCR, icon detection, and segmentation to produce perception $g_i$ 1
State Agent	$g_i$ 2, $g_i$ 3	Updates FSM, prompts MLLM for state descriptors and pre/post-conditions
Actor Agent	$g_i$ 4, $g_i$ 5	Maps subgoals and perceptions to actions $g_i$ 6; executes actions over ADB
Reflection Agent	$g_i$ 7, FSM	Compares predicted and observed states; manages rollback, recovery, and replanning
Mentor Agent	FSM, logs	Extracts reusable action/guidance sequences and stores them in memory $g_i$ 8

Workflow includes multi-candidate plan generation and ranking, dynamic FSM augmentation based on live perception and subgoal context, and robust error-handling via explicit state tracking and recovery logic.

4. Dynamic FSM Construction Algorithm

At each task step $g_i$ 9, MAPLE runs an UpdateFSM procedure. The workflow:

Initialize $r_i$ 0 for the home screen if $r_i$ 1.
Prompt the MLLM with $r_i$ 2, $r_i$ 3 to obtain $r_i$ 4, predicted $r_i$ 5, $r_i$ 6, and $r_i$ 7.
Create or retrieve state nodes $r_i$ 8 and $r_i$ 9 matching $n$ 0, $n$ 1.
Add or augment transition $n$ 2 in $n$ 3.
Actor Agent selects/executes $n$ 4.
Return the updated FSM and action.

This procedure enables real-time construction of navigation graphs, integration of state and action semantics, and structured context awareness during task execution.

5. Evaluation Methodology and Empirical Performance

MAPLE was evaluated on two challenging benchmarks:

Mobile-Eval-E: 25 tasks (19 cross-app), 15 apps, 364 reference actions.
SPA-Bench: 20 English cross-app tasks, 25 apps, 262 reference actions.

Measured Metrics:

Success Rate (SR): percentage of tasks fully completed.
Satisfaction Score (SS): rubric item completion fraction.
Action Accuracy (AA): alignment with human action trajectories.
Termination Rate (TR): percentage of prematurely aborted tasks.
Recovery Success (RS): fraction of failed subtasks successfully recovered.

Results compared to Mobile-Agent-E + Evo baseline:

Metric	Mobile-Eval-E	SPA-Bench
SS	86.15% (+7.18 pp)	88.64% (+8.33 pp)
AA	83.24% (+6.59 pp)	84.35% (+6.49 pp)
TR	16.00% (–8.00 pp)	20.00% (–5.00 pp)
SR	84.00% (+12.00 pp)	80.00% (+5.00 pp)
RS	71.88% (+4.53 pp)	66.67% (+13.81 pp)

Ablation studies demonstrated that removing any key MAPLE component (Planner Agent, multi-plan selection, pre/post-conditions, Mentor Agent) led to substantial performance degradation (e.g., SR as low as 45–52% on SPA-Bench), confirming the necessity and synergy of all core modules. With respect to LLM backbones, GPT-4o yielded the strongest results, but MAPLE maintained superiority over baselines even with weaker models (Claude-3.5, Gemini-1.5-Pro) (Guo et al., 29 May 2025).

6. Role and Impact of Structured FSM Memory

MAPLE’s FSM memory delivers multiple functional advantages:

Context Tracking: By recording visited and predicted states, the agent maintains navigation context over extended app flows, mitigating redundant actions and loops.
Error Detection: Explicit pre- and post-condition annotations provide precise criteria for detecting and diagnosing execution failures, instead of relying solely on perception deltas.
Robust Recovery: The FSM encodes reliable rollback points and recovery transitions; notably, this design increased Recovery Success on SPA-Bench by up to 13.8 percentage points.
Cross-task Knowledge Transfer: Persisting FSM graphs and distilled guidance cues in long-term memory $n$ 5 accelerates subsequent planning and execution for similar app flows.

These mechanisms collectively demonstrate that a lightweight, model-agnostic FSM memory augments MLLM-based GUI agents, yielding improved structured planning, real-time verification, and reliable error recovery. The FSM-centric design is model-agnostic and can serve as a memory layer for future mobile GUI agent architectures (Guo et al., 29 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MAPLE: A Mobile Agent with Persistent Finite State Machines for Structured Task Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MAPLE (Mobile GUI Task Reasoning).

MAPLE: Mobile GUI Task Reasoning Framework

1. System Design and Modular Architecture

2. Finite State Machine Formalism

3. Specialized Agent Functions and Workflows

4. Dynamic FSM Construction Algorithm

5. Evaluation Methodology and Empirical Performance

6. Role and Impact of Structured FSM Memory

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MAPLE: Mobile GUI Task Reasoning Framework

1. System Design and Modular Architecture

2. Finite State Machine Formalism

3. Specialized Agent Functions and Workflows

4. Dynamic FSM Construction Algorithm

5. Evaluation Methodology and Empirical Performance

6. Role and Impact of Structured FSM Memory

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research