VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Published 23 Apr 2026 in cs.CL, cs.AI, and cs.SE | (2604.21375v2)

Abstract: Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

Abstract PDF Upgrade to Chat

Authors (14)

Summary

The paper introduces VLAA-GUI, a modular framework that combines a dual-layer Completeness Verifier, a three-tier Loop Breaker, and an on-demand Search Agent to ensure task success.
It achieves a 77.5% success rate on OSWorld-Verified and surpasses human performance in single-pass evaluations, even under constrained action budgets.
Ablation studies confirm that the framework significantly reduces false completions and repetitive steps, highlighting its efficient performance in complex GUI tasks.

VLAA-GUI: Knowing When to Stop, Recover, and Search: A Modular Framework for GUI Automation

Introduction to VLAA-GUI

The paper presents VLAA-GUI, a modular framework aimed at overcoming critical challenges faced by autonomous GUI agents, particularly premature task completion and repetitive action loops. These challenges have persisted despite advances in multimodal LLMs (MLLMs), which underpin the latest generation of GUI agents designed for desktop task execution. VLAA-GUI introduces three key components to address these issues: a Completeness Verifier ensuring UI-observable success criteria, a Loop Breaker that manages and recovers from repetitive loops, and a Search Agent that queries for procedural knowledge when unfamiliar workflows are encountered.

Figure 1: Overview of VLAA-GUI. The Manager Agent decides the overall plan, deploying mandatory tools such as the Loop Breaker and Completeness Verifier, alongside on-demand tools like the Search Agent.

Modular Architecture

At the heart of VLAA-GUI is the Manager Agent, which functions in a perceive-reason-act loop, supported by a suite of tools. The Completeness Verifier and Loop Breaker are engaged after every action to ensure accurate task execution. The Completeness Verifier uses a dual-layer approach: a Completion Gate integrated into the Manager's prompt mandates self-verification at each step, while an independent verifier cross-checks completion claims. The Loop Breaker, on the other hand, employs a three-tiered system to detect and resolve repetitive action loops by switching interaction modalities and strategies.

Figure 2: Advantages of VLAA-GUI. The framework achieves top results on OSWorld-Verified and surpasses human performance.

The Manager Agent also utilizes on-demand tools when necessary, including the Search Agent, which provides essential procedural knowledge via real-time queries to LLMs, and the Coding Agent for programmatic actions. These tools enhance the system's adaptability to complex and unfamiliar tasks.

Experimental Validation

VLAA-GUI was rigorously evaluated across multiple benchmarks, demonstrating superior performance. It achieved a notable 77.5% success rate on OSWorld-Verified with the Opus 4.6 backbone, surpassing human performance in single-pass evaluations, a first for GUI automation frameworks. This success rate notably outperformed previous benchmarks which lagged behind human capabilities.

Moreover, the system was tested on WindowsAgentArena, achieving an overall success rate of 61.0%, again outperforming leading systems by more than 4%. The paper further details the efficiency of VLAA-GUI's design, where even at a constrained 15-action step budget, it surpassed the best results from previous 50-step benchmarks.

Figure 3: The Completeness Verifier reduces false completion rates and the Loop Breaker diminishes loop incidence and wasted steps on OSWorld.

Component Analysis

Ablation studies conducted within the paper highlighted the individual contributions of each component. The Completeness Verifier significantly mitigated false completion ratios, reducing misjudged terminologies across step budgets. The Loop Breaker effectively reduced loop incidences and recalibrated wasted steps, proving particularly beneficial for models with a tendency to repeat actions. Additionally, the Search Agent provided critical knowledge that enhanced procedural understanding in challenging workflows.

Figure 4: Impact of the Completeness Verifier and Search Agent. Benefits vary between efficient models like Sonnet 4.6 and less effective ones like Gemini 3 Flash.

Case Study

To illustrate the practical effectiveness of VLAA-GUI, a detailed case study was performed on OSWorld, involving a task requiring color changes in LibreOffice Impress. The system encountered and overcame multiple failures, with the Completeness Verifier preventing premature task completion and the Search Agent offering procedural insights that adjusted the workflow effectively.

Figure 5: Case study highlights interaction between system components during task execution, overcoming initial failures.

Conclusion

VLAA-GUI represents a significant step forward in addressing key limitations of autonomous GUI agents, using its modular framework to achieve human-level performance in complex desktop tasks. The primary contributions of the Completeness Verifier, Loop Breaker, and Search Agent are crucial in overcoming preconceived challenges, paving the way for future advancements in GUI automation. This paper suggests that continuing exploration in memory mechanisms and planning strategies could further refine autonomous agent capabilities, integrating efficiency with robustness.

Markdown Report Issue