Benchmarking and Improving GUI Agents in High-Dynamic Environments

Published 28 Apr 2026 in cs.CV | (2604.25380v1)

Abstract: Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces DynamicUI, a novel VLM agent architecture that uses video-based observation and adaptive frame selection to address partial observability in dynamic GUI tasks.
It outlines DynamicGUIBench, a POMDP-focused benchmark that captures challenges from transient UI states and dynamic content.
Empirical evaluations show that DynamicUI significantly boosts task accuracy, demonstrating the advantages of temporal context and reflection in GUI automation.

Benchmarking and Improving GUI Agents in High-Dynamic Environments

Motivation and Problem Setting

This paper addresses the performance degradation of Vision-LLM (VLM)-based GUI agents in high-dynamic environments, where relying solely on single-step screenshots for decision-making reduces the agent’s perception to a Partially Observable Markov Decision Process (POMDP). In such settings, salient state changes frequently occur between agent actions—such as ephemeral pop-ups, transient notifications, or dynamic content updates—that cannot always be reconstructed from static observations. The paper argues that the usual assumption of full observability, even in benchmarks purportedly designed to stress robustness, fails to reflect the operational complexity and partial observability intrinsic to real-world GUI automation scenarios.

DynamicGUIBench: A POMDP-Focused Benchmark

The authors present DynamicGUIBench, an online benchmark for evaluating GUI agents across 149 tasks spanning ten application domains, with explicit emphasis on partially observable and dynamic contexts. Task construction enforces scenarios wherein critical state information is only available through interstitial states—states manifesting between two discrete observations—which can be invisible to agents employing conventional post-action screenshot-based control. The benchmark taxonomy covers four principal categories of partial observability:

Interruptive UI States: Event-driven interruptions (e.g., system prompts or security dialogs).
Ephemeral Reference: Tasks requiring reference to or manipulation based on fleeting UI states that disappear before the next step.
Dynamic List Selection: Selection among candidate items that dynamically re-order or update (e.g., email inboxes, search results).
Content-triggered Interaction: Tasks where action triggers are short-lived or context-specific semantic signals.

DynamicGUIBench offers detailed annotations, offline reproducible interaction traces, and interactive scenarios, which are calibrated to capture temporal dependencies unaddressed in existing static or anomaly-focused datasets.

DynamicUI: Architecture for High-Dynamic GUI Environments

To address the limitations of screenshot-based observation, the authors introduce DynamicUI, a VLM-based agent that ingests video-based interaction histories rather than individual screenshots. The framework comprises three cooperative modules:

Dynamic Perceiver

The dynamic perceiver encodes the visual trajectory using a pretrained visual encoder. The sequence of frames is clustered to extract representative centroids that encapsulate distinct task phases, minimizing redundancy and focusing attention on the most informative moments. Captions are auto-generated for these centroids, and a second VLM assesses their relevance and confidence relative to the task instruction. Only clusters surpassing strict relevance and confidence thresholds are exposed to the agent.

This mechanism is shown to significantly outperform static and uniformly-sampled frame histories, especially in tasks dependent on fleeting context, by adaptively selecting frames likely to carry decisive information.

To mitigate thought-action inconsistencies and the accumulation of error-prone interaction histories—particularly acute in long-horizon POMDP settings—a VLM is employed to revise both action plans and textual rationales based on observed action outcomes and neighboring visual context. This step targets correction of cases where semantic alignment between the agent’s internal deliberation and executed actions degrades, a common failure mode in existing agentic pipelines.

Reflection Module

The reflection module introduces auxiliary feedback to prevent drift in the agent’s decision process. A separate VLM, distinct from that used for stepwise execution, leverages the selected keyframes, action histories, and textual rationales to generate corrective guidance. This module reassesses the agent’s progress, identifies possible errors, and produces actionable next-step recommendations, which demonstrably improve long-horizon task completion and error recovery.

Empirical Evaluation

DynamicUI is extensively benchmarked on DynamicGUIBench and OSWorld. Across both proprietary and open-weight base models (e.g., Qwen3-vl-8B, UITARS-1.5-7B), DynamicUI leads to marked improvements—raising average task accuracy from 15.1% to 22.1% on DynamicGUIBench and exhibiting domain-agnostic gains on OSWorld. Notably, DynamicUI is the only agent demonstrating non-zero performance on the Thunderbird domain (44.4% accuracy vs. 0% for all baselines), directly validating its robustness to temporal state perturbations.

Ablation studies show that the dynamic perceiver and reflection module are primary contributors to performance increases, with the refinement strategy providing incremental improvement. Comparative analysis also reveals that naive uniform frame sampling, even with increased frame counts, is ineffective relative to the intelligent frame selection implemented in the dynamic perceiver.

Category-level breakdowns indicate that DynamicUI is particularly effective for tasks involving explicit state interruptions and list dynamics, while performance on ephemeral references and content-triggered interactions, though improved, remains an open research problem due to the intrinsic challenge of reconstructing high-frequency context.

Implications and Future Directions

The introduction of DynamicGUIBench exposes the limitations of extant GUI agents and benchmarks, advocating for an explicit shift toward evaluation paradigms incorporating POMDP-style interaction, where full state observability cannot be assumed. The DynamicUI agent demonstrates that video-based context and adaptive history selection can significantly enhance robustness in high-dynamic interfaces, making the case for more temporally-aware and reflection-capable vision-language pipelines.

In practical terms, the findings are directly relevant for any automation pipeline where non-deterministic UI perturbations are prevalent, such as enterprise RPA, accessibility tools, or consumer desktop assistants. Theoretically, the framework opens research directions in task-oriented multimodal temporal reasoning and agent design under acute partial observability, with further room for improvement in memory mechanisms, meta-cognition, and goal-aligned long-horizon planning.

Conclusion

This paper decisively quantifies the deficit of screenshot-based agents in dynamic environments and establishes a convincing case for temporally structured, video-based perception in GUI automation. DynamicUI’s architecture, combining dynamic context selection, thought-action consistency checking, and reflection-based correction, sets a new standard for agent robustness in POMDP regimes and motivates future research into integrating rich temporal evidence and meta-reasoning into multimodal agentic systems.

Markdown Report Issue