BEAP-Agent: GUI Automation Framework
- BEAP-Agent is a GUI task automation framework that formalizes task execution as recursive tree search over deterministic state-transition graphs.
- It employs a tri-module architecture—Planner, Executor, and Tracker—to dynamically plan tasks and enable robust multi-level backtracking.
- Empirical benchmarks demonstrate that BEAP-Agent achieves 28.2% accuracy on complex desktop tasks, illustrating its effective recovery strategies.
BEAP-Agent is a framework for GUI agents that combines depth-first search (DFS)-based exploration with systematic, multi-level backtracking and dynamic task tracking, specifically designed to overcome the lack of robust recovery mechanisms in prior systems. BEAP-Agent formalizes GUI task execution as recursive tree search over a deterministic state-transition graph and implements an architecture comprising Planner, Executor, and Tracker modules. This enables principled handling of long-horizon tasks, rigorous adaptation after erroneous path exploration, and empirical performance gains on challenging desktop automation benchmarks (Lu et al., 29 Jan 2026).
1. Formal Modeling of GUI Task Execution
BEAP-Agent frames GUI task automation as a search process on the tuple , where:
- is the set of all GUI states (e.g., screenshots plus any internal representations),
- is the set of legal actions at state ,
- is the deterministic environment transition function such that .
Execution is structured as constructing a search tree
Transition history is tracked within explored marker set . For each , the unexplored outgoing actions are . Depth-first recursive traversal proceeds as:
Failed transitions are stored in the failure set to avoid future repetition.
2. System Architecture: Planner, Executor, Tracker
BEAP-Agent is structured as three interactive modules, each with clearly defined responsibilities:
- Planner: Receives current state , full task description , and failure set . It outputs a plan with . After backtracking, completed subtasks are preserved, new instructions avoid , and planning is regenerated. The process is formalized via pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
\Function{PlanGenerator}{%%%%19%%%%}
\State Identify high-level subtasks for %%%%20%%%%
\For{each subtask %%%%21%%%%}
\If{%%%%22%%%% was completed}
%%%%23%%%%
\Else
%%%%24%%%%
\Comment{Regenerate instructions avoiding %%%%25%%%%}
\EndIf
\EndFor
\Return %%%%26%%%%
\EndFunction |
- Executor: Given , selects and issues primitive GUI actions using PyAutoGUI. The outcome is appended to history stack . Backtrack mode executes inverses , restoring previous snapshots.
- Tracker: Operates in normal and backtrack modes. In normal mode, it monitors progress:
- Updates statuses by inspecting .
- Emits event when the task is satisfied, when feasible, if , and for unrecoverable outcome.
In backtrack mode, the tracker validates state restoration, outputting .
3. Backtracking Protocol: Detection, Pruning, Recovery
BEAP-Agent’s core novelty lies in its systematic, long-range backtracking pipeline, which extends beyond ad hoc or single-step undo. The formal process is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
\Function{DFSExec}{%%%%39%%%%}
%%%%40%%%% TrackerNormal(%%%%41%%%%)
\If{%%%%42%%%%} \Return SUCCESS
\ElsIf{%%%%43%%%%}
%%%%44%%%% Executor(%%%%45%%%%)
Append %%%%46%%%% to %%%%47%%%%, mark %%%%48%%%%
%%%%49%%%%
\Return DFSExec(%%%%50%%%%)
\ElsIf{%%%%51%%%%}
\Return BacktrackRoutine(%%%%52%%%%)
\Else
\Return FAIL
\EndIf
\EndFunction
\Function{BacktrackRoutine}{%%%%53%%%%}
\While{%%%%54%%%%}
Pop last %%%%55%%%% from %%%%56%%%%
%%%%57%%%%
Generate inverse action %%%%58%%%% via Executor
Replay %%%%59%%%% in environment
%%%%60%%%% TrackerBack(%%%%61%%%%)
\If{%%%%62%%%%}
%%%%63%%%% Planner(%%%%64%%%%)
\Return DFSExec(%%%%65%%%%)
\EndIf
\EndWhile
\Return FAIL
\EndFunction |
At each failure, transitions are pruned from future search by appending to . Recovery triggers the Planner to generate updated plans preserving achieved subtasks.
4. Empirical Performance on the OSWorld Benchmark
BEAP-Agent is systematically benchmarked on OSWorld (Xie et al., NeurIPS 2024), which provides 369 desktop tasks sampled from OS operations, office, daily apps, professional software, and multi-app workflows. The standardized input consists of 256×256 screenshots, with up to 50 GUI actions permitted per agent.
Comparative Results Table
| Framework | Steps | Accuracy |
|---|---|---|
| Agent S2 (GPT-4o+UI-TARS-72B) | 50 | 26.6% |
| JEDI (GPT-4o+JEDI-7B) | 50 | 25.0% |
| UI-TARS (1.5-7B) | 50 | 24.0% |
| AGUVIS (GPT-4o+AGUVIS-72B) | — | 17.0% |
| Qwen2.5 (Qwen2.5-vl-72B) | — | 8.8% |
| OpenAI (GPT-4o alone) | — | 5.0% |
| BEAP-Agent | 50 | 28.2% |
| — w/o Backtrack | 50 | 26.3% |
| — w/o Tracker | 50 | 23.6% |
Backtracking-specific metrics for BEAP-Agent:
- Backtracking Task Rate: 35.8% of tasks required at least one backtrack.
- Backtrack Success Rate: 65.5% successful restoration of valid ancestors per attempt.
- Average Backtrack Steps: 2.72.
- Domain-wise results indicate pronounced gains in "chrome" and "workflow" tasks, with backtrack-success exceeding 80%.
5. Strengths, Limitations, and Prospects
BEAP-Agent’s recursive DFS formulation with integrated Planner–Executor–Tracker architecture confers several advantages relative to prior GUI agent frameworks:
- Systematic multi-level backtracking supports long-horizon recovery of complex tasks.
- The close triadic loop maintains grounded execution and context-aware updating, facilitating adaptive task management.
- Quantitative gains are demonstrated, e.g., 28.2% accuracy versus 24.0% for the identical Executor without backtracking/tracking enhancements.
Limitations pertain to perceptual challenges: even with GPT-4o, fine-grained UI element inference may fail, and smaller specialist models can supersede general-purpose agents in constrained subtasks. The backtracking protocol, relying on inverse replay and stack snapshotting, is vulnerable to repeated failures when state restoration is imperfect.
Future directions include model hybridization (e.g., vision-language paired with specialized perception modules), integration of learned heuristics for branch prioritization, and reward-shaping interventions to mitigate sparse reward delay phenomena. This suggests further breadth in integrating state-patching techniques and potentially moving beyond inverse replay for restoration.
6. Contextual Significance
BEAP-Agent establishes a systematic, high-accuracy approach to GUI agent planning, execution, and recovery, filling a notable gap in the literature on automated desktop task exploration. By leveraging DFS with formalized multi-level backtracking, the framework achieves state-of-the-art OSWorld benchmark results and codifies a rigorous architectural template for future development in adaptive GUI automation agents (Lu et al., 29 Jan 2026). A plausible implication is the increasing relevance of principled search-based planning and robust context tracking in scaling interface agents towards generalized real-world applicability.