Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 76 tok/s

Gemini 2.5 Pro 59 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents (2506.03143v1)

Published 3 Jun 2025 in cs.CL, cs.AI, and cs.CV

Abstract: One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <ACTOR> token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

Summary

The paper introduces GUI-Actor, a novel framework that bypasses coordinate-based methods by using an attention-based action head and a dedicated <ACTOR> token for direct GUI element identification.
It applies spatial-aware multi-patch supervision and a grounding verifier to enhance robustness and sample efficiency in detecting GUI elements.
Experimental results demonstrate state-of-the-art performance across multiple benchmarks, with improved generalization over varying screen layouts and resolutions.

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Introduction

The paper "GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents" proposes a novel visual grounding framework that addresses the limitations of existing coordinate-based approaches for graphical user interface (GUI) agents. This work introduces GUI-Actor, a vision-LLM (VLM) leveraging an attention-based action head to achieve coordinate-free grounding. Unlike traditional methods that rely on generating numeric coordinates to interact with GUI elements, GUI-Actor directly identifies target regions using a dedicated <ACTOR> token, enhancing spatial-semantic alignment and generalization across varying screen layouts and resolutions.

Methodology

Overview of GUI-Actor

The GUI-Actor framework is built upon a VLM equipped with an attention-based action head, enabling the direct mapping of instructions to GUI elements without the need for coordinate generation.

Figure 1: Overview of GUI-Actor. (a) Illustration of how the action head works with a VLM for coordinate-free GUI grounding. (b) Illustration of the spatial-aware multi-patch supervision for model training, labeling image patches as positive or negative based on ground-truth bounding boxes.

<ACTOR> Token as a Contextual Anchor

In lieu of calculating precise screen coordinates, GUI-Actor introduces a <ACTOR> token that processes visual and textual inputs simultaneously. This token serves as a contextual anchor within the model, facilitating the alignment of visual patch features with the target element.

Attention-Based Action Head

The action head computes attention scores over visual patch tokens derived from the screenshot, establishing a spatial activation map to highlight the most relevant regions for interaction. Through multi-patch supervision, the model accommodates the natural ambiguity present in GUI interactions, allowing for more robust target identification.

Grounding Verifier

A grounding verifier refines action decisions by evaluating multiple candidate regions. By scoring these candidates, the verifier enhances the model's grounding accuracy, ensuring the selected region aligns with the user's intent.

Experimental Results

Benchmarks and Performance

GUI-Actor demonstrates state-of-the-art performance across multiple GUI grounding benchmarks, including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, which features high-resolution interfaces and significant domain shifts.

Figure 2: Left: Model performance vs. training data scale on the ScreenSpot-Pro benchmark. Right: Illustration of action attention. GUI-Actor grounds target elements by attending to the most relevant visual regions.

GUI-Actor outperforms previously established methods, especially under conditions that demand high generalization capabilities. The coordinate-free approach allows GUI-Actor to maintain competitive accuracy even with models that possess significantly fewer parameters and training data.

Robustness and Efficiency

Figure 3: Accuracy Progression Over Training Steps.

GUI-Actor achieves enhanced sample efficiency, requiring less training data compared to coordinate-based approaches to reach similar or superior levels of accuracy. This efficiency is attributed to the model's explicit use of spatial-semantic alignment through its novel attention mechanism.

Discussion on Practical Implications

GUI-Actor exemplifies how VLMs can be tailored for GUI interaction without compromising their general-purpose capabilities. By fine-tuning only the newly introduced action head, the model equips the underlying VLM with effective visual grounding features, preserving its broad utility. The integration of a verifier further exemplifies the model's ability to refine action decisions efficiently, highlighting its potential for real-world applications across diverse user interfaces.

Conclusion

The GUI-Actor framework represents a significant advancement in the development of GUI agents, effectively circumventing the limitations of coordinate-based visual grounding methods. Its innovative use of attention-based mechanisms for direct region identification, coupled with a lightweight verifier, equips GUI agents with robust grounding capabilities adaptable to various screen configurations. The experimental results underscore its potential to set new standards in the field of GUI interactions, paving the path for future research on enhanced VLM integration for user interface tasks.