What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

Published 8 Apr 2026 in cs.AI | (2604.06995v1)

Abstract: Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal LLMs (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods' mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces the UI-in-the-Loop paradigm that decouples UI element analysis from action prediction to enhance model interpretability.
It presents a modular framework with augmented datasets and Group Relative Policy Optimization to improve UI element localization and semantic alignment.
Empirical results show UILoop-trained models outperform traditional screen-to-action approaches, achieving significant gains in UI comprehension and overall GUI performance.

UI-in-the-Loop: A Paradigm Shift for Multimodal GUI Reasoning

Motivation and Limitations of Screen-to-Action Paradigm

The paper introduces a critical examination of the prevailing "Screen-to-Action" paradigm in multimodal GUI reasoning, highlighting substantial deficiencies in model interpretability and UI element comprehension. Empirical evaluation reveals that state-of-the-art MLLMs and GUI agents exhibit suboptimal performance in explicit localization, semantic description, and practical usage of UI elements—with average scores consistently below 0.1 across these facets. This deficit directly impairs GUI task completion, demonstrating that correct UI element understanding is necessary for robust interactive agents, while misleading information exacerbates failure rates.

Figure 1: Comparative evaluation of UI element localization, semantic function description, practical usage, and relative performance gains with correct/misleading UI info; UILoop outperforms existing Screen-to-Action methods on Android Control-High.

Conceptualization: UI-in-the-Loop (UILoop) Paradigm

To address these entrenched limitations, the UI-in-the-Loop (UILoop) paradigm is proposed, reframing GUI reasoning as a cyclic "Screen–UI Elements–Action" process, in contrast with the opaque "Screen-to-Action" pipeline. UILoop formalizes explicit mastery of UI elements as a prerequisite for action decision. By decoupling UI element analysis from action prediction, the paradigm enhances model interpretability, enabling reasoning traceability through intermediate representations and facilitating intrinsic learning of interface layout, semantics, and actionable affordances.

Figure 2: Schematic contrast between "Screen-to-Action" and the proposed "Screen-UI Elements-Action" paradigm.

Framework Design: Benchmark and Reinforcement Fine-Tuning

UILoop is realized via a modular framework encompassing data augmentation, benchmark construction, and reinforcement fine-tuning.

UI Comprehension-Bench Construction

Existing datasets under-annotate fine-grained UI information. UI Comprehension-Bench is constructed through augmentation—annotating screens with OmniParser V2 for UI element coordinates, selection via GPT-4o for functional relevance, followed by human curation. This dataset comprises 26,207 samples, with comprehensive metadata for (instruction, screen, key UI elements, action), enabling three dedicated evaluation metrics: Locate (coordinate accuracy), Lingualize (semantic alignment), and Leverage (utilization correctness).

Figure 3: Overview of the UI-in-the-Loop (UILoop) framework, illustrating both data synthesis and RL-based fine-tuning.

Data Analysis

The benchmark includes 1,576,068 UI elements with less than 4% marked as ground truth keys. Identification is challenging, with over a quarter of samples containing only a single GT element among distractors. Coverage rates for UI reasoning exceed 90% for most actions, confirming the logical coherence of key elements.

Figure 4: UI Comprehension-Bench statistics: GT UI element distribution and semantic description token lengths; effective usage of GT elements in action inference.

UI Element-Driven Reinforcement Fine-Tuning

UILoop employs Group Relative Policy Optimization (GRPO) RL, integrating specialized rewards:

Format Reward: Enforces structured output for transparency.
Location Reward: Quantifies spatial accuracy vs. GT coordinates.
Lingualization Reward: Measures semantic similarity between predicted and GT descriptions.
Leverage Reward: Assesses correctness of actionable usage.

These components are hierarchically composed to prioritize locating and semantically understanding key UI elements before learning their functional actuation.

Empirical Results: UI Comprehension and GUI Reasoning

Extensive experiments on ScreenSpot-Pro and Android Control-High benchmarks establish the superiority of UILoop-trained models. UILoop outperforms baselines—including zero-shot and SFT/RFT "Screen-to-Action" MLLMs—by substantial margins. Notably:

UILoop-3B and 7B outperform Qwen2.5-VL and GUI-R1 respectively by up to 13.3%, 3.2% (overall scores) and by 46.5% on SR over OS-Atlas-7B.
UI Comprehension-Bench evaluations show traditional models score <10% across Locate, Lingualize, Leverage, while UILoop-7B attains 26.1 overall, improving GUI-Owl-7B by 18.4.
Figure 5: Ablation study quantifying Locate, Lingualize, and Leverage rewards' individual contributions to UI comprehension and reasoning performance.

Ablation and Case Analysis

Ablation demonstrates the necessity of all reward components—Leverage alone improves metrics, Locate Reward augments GR, and Lingualize Reward enhances SR. Case studies reveal UILoop's explicit reasoning trace: in instructions with visually ambiguous elements, UILoop correctly localizes and semantically interprets the intended action target, circumventing typical pitfalls of direct screen-based approaches.

Figure 6: Case comparison between Screen-to-Action and UILoop approaches, emphasizing interpretability through explicit UI element semantic reasoning.

Dataset Demonstrations and Action Diversity

Representative examples from UI Comprehension-Bench highlight the difference in annotation richness and traceability for open_app, type, and click actions. This allows models to robustly generalize across action modalities.

Figure 7: Example case with open_app actions and key UI elements from UI Comprehension-Bench.

Figure 8: Example case with type actions and associated reasoning chain.

Figure 9: Example case with click actions depicting semantic-functional annotation.

Error Analysis

UILoop achieves markedly reduced error counts for Locate, Lingualize, and Leverage error types, demonstrating advanced mastery in UI element handling relative to leading Screen-to-Action models.

Figure 10: Cross-model analysis of error types, showing UILoop's advantage in minimizing common UI-related reasoning failures.

Implications and Future Directions

The introduction of UI-in-the-Loop fundamentally realigns objective functions for multimodal GUI agents, moving away from black-box action prediction toward transparency and intrinsic UI element comprehension. Practically, this enables more reliable, interpretable automation across heterogeneous GUI environments, supporting application domains such as office productivity, web browsing, and device control. Theoretical implications suggest improved sample complexity for downstream RL fine-tuning, enhanced zero-shot reliability, and better transferability across UI domains.

The limitations noted—such as the lack of coarse-grained UI layout modeling and primary focus on Qwen2.5-VL—motivate future work on hierarchical UI representations and broader MLLM generalization. The paradigm may catalyze principled advances in vision-language-action architectures and agent reasoning interoperability.

Conclusion

The paper delineates the critical missing link in existing "Screen-to-Action" approaches for GUI agents and provides a systematic UI-in-the-Loop paradigm with explicit UI Element-Driven RL fine-tuning. Empirical results on extensive benchmarks substantiate bold claims regarding both interpretability and actionable performance gains. The UI Comprehension-Bench establishes a robust foundation for advancing GUI agent capabilities, and UILoop marks an inflection point in multimodal agent reasoning methodology (2604.06995).

Markdown Report Issue