BEAP-Agent: GUI Automation Framework

Updated 1 February 2026

BEAP-Agent is a GUI task automation framework that formalizes task execution as recursive tree search over deterministic state-transition graphs.
It employs a tri-module architecture—Planner, Executor, and Tracker—to dynamically plan tasks and enable robust multi-level backtracking.
Empirical benchmarks demonstrate that BEAP-Agent achieves 28.2% accuracy on complex desktop tasks, illustrating its effective recovery strategies.

BEAP-Agent is a framework for GUI agents that combines depth-first search (DFS)-based exploration with systematic, multi-level backtracking and dynamic task tracking, specifically designed to overcome the lack of robust recovery mechanisms in prior systems. BEAP-Agent formalizes GUI task execution as recursive tree search over a deterministic state-transition graph and implements an architecture comprising Planner, Executor, and Tracker modules. This enables principled handling of long-horizon tasks, rigorous adaptation after erroneous path exploration, and empirical performance gains on challenging desktop automation benchmarks (Lu et al., 29 Jan 2026).

1. Formal Modeling of GUI Task Execution

BEAP-Agent frames GUI task automation as a search process on the tuple $G = (S,\;A,\;T)$ , where:

$S$ is the set of all GUI states (e.g., screenshots plus any internal representations),
$A(s) \subseteq A$ is the set of legal actions at state $s$ ,
$T: S \times A \to S$ is the deterministic environment transition function such that $T(s,a)=s'$ .

Execution is structured as constructing a search tree

$\mathcal{T} = \left\{ (s, a, s') \;\middle|\; s' = T(s, a),\; s \in S,\; a \in A(s) \right\}$

Transition history is tracked within explored marker set $Z \subseteq \mathcal{T}$ . For each $s$ , the unexplored outgoing actions are $U(s) = \left\{ a \in A(s) \mid (s, a, T(s, a)) \notin Z \right\}$ . Depth-first recursive traversal proceeds as:

$\mathrm{DFS}(s) = \begin{cases} \text{DONE}, & \text{if the task-completion condition holds at } s \ \mathrm{DFS}(T(s, a)), & \exists\,a\in U(s) \ \mathrm{Backtrack}(s), & U(s) = \varnothing \end{cases}$

Failed transitions $(s, a, s')$ are stored in the failure set $F \leftarrow F \cup \{(s, a)\}$ to avoid future repetition.

2. System Architecture: Planner, Executor, Tracker

BEAP-Agent is structured as three interactive modules, each with clearly defined responsibilities:

Planner: Receives current state $s$ , full task description $X$ , and failure set $F$ . It outputs a plan $P = \{ (\mathrm{subtask}_i, \mathrm{status}_i) \}_{i=1}^n$ with $\mathrm{status}_i \in \{\mathrm{PENDING}, \mathrm{COMPLETED}\}$ . After backtracking, completed subtasks are preserved, new instructions avoid $F$ , and planning is regenerated. The process is formalized via pseudocode:

\Function{PlanGenerator}{%%%%19%%%%}
  \State Identify high-level subtasks for %%%%20%%%%
  \For{each subtask %%%%21%%%%}
    \If{%%%%22%%%% was completed}

      %%%%23%%%%

    \Else

      %%%%24%%%%

      \Comment{Regenerate instructions avoiding %%%%25%%%%}
    \EndIf
  \EndFor
  \Return %%%%26%%%%
\EndFunction

Executor: Given $(s, X, P, H)$ , selects and issues primitive GUI actions $a \in A(s)$ using PyAutoGUI. The outcome $s' = T(s,a)$ is appended to history stack $H$ . Backtrack mode executes inverses $a^{-1}$ , restoring previous snapshots.
Tracker: Operates in normal and backtrack modes. In normal mode, it monitors progress:
- Updates statuses by inspecting $s$ .
- Emits event $E = \mathrm{DONE}$ when the task is satisfied, $E = \mathrm{CONTINUE}$ when feasible, $E = \mathrm{BACKTRACK}$ if $U(s) = \varnothing$ , and $E = \mathrm{FAIL}$ for unrecoverable outcome.

In backtrack mode, the tracker validates state restoration, outputting $E_{\mathrm{back}} \in \{\mathrm{RECOVERED}, \mathrm{NOT\_RECOVERED}\}$ .

3. Backtracking Protocol: Detection, Pruning, Recovery

BEAP-Agent’s core novelty lies in its systematic, long-range backtracking pipeline, which extends beyond ad hoc or single-step undo. The formal process is:

\Function{DFSExec}{%%%%39%%%%}
  %%%%40%%%% TrackerNormal(%%%%41%%%%)
  \If{%%%%42%%%%} \Return SUCCESS
  \ElsIf{%%%%43%%%%}
    %%%%44%%%% Executor(%%%%45%%%%)
    Append %%%%46%%%% to %%%%47%%%%, mark %%%%48%%%%

    %%%%49%%%%

    \Return DFSExec(%%%%50%%%%)
  \ElsIf{%%%%51%%%%}
    \Return BacktrackRoutine(%%%%52%%%%)
  \Else
    \Return FAIL
  \EndIf
\EndFunction

\Function{BacktrackRoutine}{%%%%53%%%%}
  \While{%%%%54%%%%}
    Pop last %%%%55%%%% from %%%%56%%%%

    %%%%57%%%%

    Generate inverse action %%%%58%%%% via Executor
    Replay %%%%59%%%% in environment
    %%%%60%%%% TrackerBack(%%%%61%%%%)
    \If{%%%%62%%%%}
      %%%%63%%%% Planner(%%%%64%%%%)
      \Return DFSExec(%%%%65%%%%)
    \EndIf
  \EndWhile
  \Return FAIL
\EndFunction

At each failure, transitions are pruned from future search by appending to $F$ . Recovery triggers the Planner to generate updated plans preserving achieved subtasks.

4. Empirical Performance on the OSWorld Benchmark

BEAP-Agent is systematically benchmarked on OSWorld (Xie et al., NeurIPS 2024), which provides 369 desktop tasks sampled from OS operations, office, daily apps, professional software, and multi-app workflows. The standardized input consists of 256×256 screenshots, with up to 50 GUI actions permitted per agent.

Comparative Results Table

Framework	Steps	Accuracy
Agent S2 (GPT-4o+UI-TARS-72B)	50	26.6%
JEDI (GPT-4o+JEDI-7B)	50	25.0%
UI-TARS (1.5-7B)	50	24.0%
AGUVIS (GPT-4o+AGUVIS-72B)	—	17.0%
Qwen2.5 (Qwen2.5-vl-72B)	—	8.8%
OpenAI (GPT-4o alone)	—	5.0%
BEAP-Agent	50	28.2%
— w/o Backtrack	50	26.3%
— w/o Tracker	50	23.6%

Backtracking-specific metrics for BEAP-Agent:

Backtracking Task Rate: 35.8% of tasks required at least one backtrack.
Backtrack Success Rate: 65.5% successful restoration of valid ancestors per attempt.
Average Backtrack Steps: 2.72.
Domain-wise results indicate pronounced gains in "chrome" and "workflow" tasks, with backtrack-success exceeding 80%.

5. Strengths, Limitations, and Prospects

BEAP-Agent’s recursive DFS formulation with integrated Planner–Executor–Tracker architecture confers several advantages relative to prior GUI agent frameworks:

Systematic multi-level backtracking supports long-horizon recovery of complex tasks.
The close triadic loop maintains grounded execution and context-aware updating, facilitating adaptive task management.
Quantitative gains are demonstrated, e.g., 28.2% accuracy versus 24.0% for the identical Executor without backtracking/tracking enhancements.

Limitations pertain to perceptual challenges: even with GPT-4o, fine-grained UI element inference may fail, and smaller specialist models can supersede general-purpose agents in constrained subtasks. The backtracking protocol, relying on inverse replay and stack snapshotting, is vulnerable to repeated failures when state restoration is imperfect.

Future directions include model hybridization (e.g., vision-language paired with specialized perception modules), integration of learned heuristics for branch prioritization, and reward-shaping interventions to mitigate sparse reward delay phenomena. This suggests further breadth in integrating state-patching techniques and potentially moving beyond inverse replay for restoration.

6. Contextual Significance

BEAP-Agent establishes a systematic, high-accuracy approach to GUI agent planning, execution, and recovery, filling a notable gap in the literature on automated desktop task exploration. By leveraging DFS with formalized multi-level backtracking, the framework achieves state-of-the-art OSWorld benchmark results and codifies a rigorous architectural template for future development in adaptive GUI automation agents (Lu et al., 29 Jan 2026). A plausible implication is the increasing relevance of principled search-based planning and robust context tracking in scaling interface agents towards generalized real-world applicability.

Markdown Upgrade to Chat

References (1)

BEAP-Agent: Backtrackable Execution and Adaptive Planning for GUI Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BEAP-Agent.