- The paper introduces VLAA-GUI, a modular framework that combines a dual-layer Completeness Verifier, a three-tier Loop Breaker, and an on-demand Search Agent to ensure task success.
- It achieves a 77.5% success rate on OSWorld-Verified and surpasses human performance in single-pass evaluations, even under constrained action budgets.
- Ablation studies confirm that the framework significantly reduces false completions and repetitive steps, highlighting its efficient performance in complex GUI tasks.
VLAA-GUI: Knowing When to Stop, Recover, and Search: A Modular Framework for GUI Automation
Introduction to VLAA-GUI
The paper presents VLAA-GUI, a modular framework aimed at overcoming critical challenges faced by autonomous GUI agents, particularly premature task completion and repetitive action loops. These challenges have persisted despite advances in multimodal LLMs (MLLMs), which underpin the latest generation of GUI agents designed for desktop task execution. VLAA-GUI introduces three key components to address these issues: a Completeness Verifier ensuring UI-observable success criteria, a Loop Breaker that manages and recovers from repetitive loops, and a Search Agent that queries for procedural knowledge when unfamiliar workflows are encountered.
Figure 1: Overview of VLAA-GUI. The Manager Agent decides the overall plan, deploying mandatory tools such as the Loop Breaker and Completeness Verifier, alongside on-demand tools like the Search Agent.
Modular Architecture
At the heart of VLAA-GUI is the Manager Agent, which functions in a perceive-reason-act loop, supported by a suite of tools. The Completeness Verifier and Loop Breaker are engaged after every action to ensure accurate task execution. The Completeness Verifier uses a dual-layer approach: a Completion Gate integrated into the Manager's prompt mandates self-verification at each step, while an independent verifier cross-checks completion claims. The Loop Breaker, on the other hand, employs a three-tiered system to detect and resolve repetitive action loops by switching interaction modalities and strategies.
Figure 2: Advantages of VLAA-GUI. The framework achieves top results on OSWorld-Verified and surpasses human performance.
The Manager Agent also utilizes on-demand tools when necessary, including the Search Agent, which provides essential procedural knowledge via real-time queries to LLMs, and the Coding Agent for programmatic actions. These tools enhance the system's adaptability to complex and unfamiliar tasks.
Experimental Validation
VLAA-GUI was rigorously evaluated across multiple benchmarks, demonstrating superior performance. It achieved a notable 77.5% success rate on OSWorld-Verified with the Opus 4.6 backbone, surpassing human performance in single-pass evaluations, a first for GUI automation frameworks. This success rate notably outperformed previous benchmarks which lagged behind human capabilities.
Moreover, the system was tested on WindowsAgentArena, achieving an overall success rate of 61.0%, again outperforming leading systems by more than 4%. The paper further details the efficiency of VLAA-GUI's design, where even at a constrained 15-action step budget, it surpassed the best results from previous 50-step benchmarks.
Figure 3: The Completeness Verifier reduces false completion rates and the Loop Breaker diminishes loop incidence and wasted steps on OSWorld.
Component Analysis
Ablation studies conducted within the paper highlighted the individual contributions of each component. The Completeness Verifier significantly mitigated false completion ratios, reducing misjudged terminologies across step budgets. The Loop Breaker effectively reduced loop incidences and recalibrated wasted steps, proving particularly beneficial for models with a tendency to repeat actions. Additionally, the Search Agent provided critical knowledge that enhanced procedural understanding in challenging workflows.
Figure 4: Impact of the Completeness Verifier and Search Agent. Benefits vary between efficient models like Sonnet 4.6 and less effective ones like Gemini 3 Flash.
Case Study
To illustrate the practical effectiveness of VLAA-GUI, a detailed case study was performed on OSWorld, involving a task requiring color changes in LibreOffice Impress. The system encountered and overcame multiple failures, with the Completeness Verifier preventing premature task completion and the Search Agent offering procedural insights that adjusted the workflow effectively.
Figure 5: Case study highlights interaction between system components during task execution, overcoming initial failures.
Conclusion
VLAA-GUI represents a significant step forward in addressing key limitations of autonomous GUI agents, using its modular framework to achieve human-level performance in complex desktop tasks. The primary contributions of the Completeness Verifier, Loop Breaker, and Search Agent are crucial in overcoming preconceived challenges, paving the way for future advancements in GUI automation. This paper suggests that continuing exploration in memory mechanisms and planning strategies could further refine autonomous agent capabilities, integrating efficiency with robustness.