GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

Published 27 Apr 2026 in cs.CV | (2604.23941v1)

Abstract: Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-LLM (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces GoClick, a 230M-parameter model that robustly grounds GUI elements using Florence-2's encoder-decoder architecture.
It leverages a Progressive Data Refinement pipeline to curate a high-quality 3.8M sample dataset, enhancing both data efficiency and grounding accuracy.
Experiments on seven benchmarks demonstrate high accuracy and reduced latency compared to larger models, enabling effective on-device autonomous GUI interaction.

GoClick: Compact GUI Element Grounding for Autonomous Interaction

Motivation and Context

Graphical User Interface (GUI) element grounding—a procedure for precisely localizing interface objects given natural language prompts—is pivotal for multimodal autonomous agents, especially when deployed on resource-constrained platforms such as mobile devices. While recent Vision-LLMs (VLMs) demonstrate high performance on visual grounding tasks, these typically feature large-scale architectures (≥ 2.5B parameters), rendering them impractical for edge device deployment due to latency and memory bottlenecks. Addressing this gap, the paper presents GoClick, a lightweight GUI grounding model that achieves high accuracy with only 230M parameters, thus advancing on-device viability without compromising grounding efficacy.

Figure 1: Illustration of GUI interaction by an agent automating GUI operations via robust element grounding.

Model Architecture and Design Choices

GoClick adopts the Florence-2 encoder-decoder architecture over popular decoder-only designs, capitalizing on the proven advantage of encoder-decoder models for narrow tasks at reduced parameter scales. The model pipeline incorporates a visual encoder (e.g., ViT) and concatenates visual embeddings with textual representations, enabling focused spatial localization rather than generic, expansive language generation.

The architectural decision stands in contrast to downsizing large decoder-only VLMs, which the paper shows results in suboptimal grounding under parameter budget constraints. Extensive experiments confirm the superiority of Florence-2’s structure for GUI grounding at scales down to 230M parameters, outperforming several competitive decoder-only variants with larger parameter counts.

Figure 2: GoClick leverages Florence-2’s encoder-decoder architecture for efficient multi-modal spatial grounding.

GoClick’s efficacy stems from not only its architecture but also rigorous training data refinement. The authors curate a massive 10.8M-sample dataset, spanning diverse RE types: Alt-texts, Brief Descriptions, Action Intents, and Functionality Descriptions. Recognizing the limits of small models and the heterogeneity of GUI tasks, the Progressive Data Refinement (PDR) pipeline is used to enhance core sample quality through:

Coarse-grained filtering (removing outdated GUI patterns and non-grounding tasks)
Fine-grained adjustment (tuning inclusion ratios for task types across data sources)

This process yields a high-quality 3.8M core set, verified empirically to result in substantial accuracy improvements and training efficiency.

Figure 3: Data generation and refinement: diverse GUI grounding tasks are filtered and balanced to create the core training set.

Experimental Results

Speed-Accuracy Tradeoff

GoClick’s performance is evaluated on seven GUI grounding benchmarks (FuncPred, ScreenSpot, ScreenSpot-v2, MOTIF, RefExp, VisualWebBench EG and AG), using Grounding Accuracy, Time-To-First-Token (TTFT), and Time-Per-Output-Token (TPOT) as metrics. GoClick not only consistently surpasses competing lightweight models—often achieving parity with or outperforming 7B+ models—but also offers a marked reduction in inference latency and model size. For example, GoClick achieves 78.5% ScreenSpot accuracy with only 0.8B parameters, outperforming both Qwen2VL-2B and SLiME-Gemma-2B and matching Qwen2VL-7B.

Figure 4: GoClick attains superior grounding accuracy at reduced model size and increased inference speed compared to competitive VLMs.

Architectural Ablation

Comparative studies between encoder-decoder and decoder-only model variants, fine-tuned on the same dataset, further validate GoClick’s architecture. Florence-2-based GoClick surpasses decoder-only alternatives even at higher parameter scales, demonstrating robustness across all GUI domains and task types. The model also exhibits lower performance differentials between text (easy) and icon (hard) grounding tasks, underscoring cross-domain generalizability.

Data Efficiency

Rigorous ablation in the PDR pipeline reveals decisive trends: inclusion of outdated GUI samples or dual REG tasks (expression generation with same targets) degrades performance, and naive aggregation from some sources can induce catastrophic overfitting. By finely tuning task ratios per source, GoClick’s final accuracy outstrips models trained on larger, unrefined datasets, demonstrating parameter and data efficiency.

Figure 5: Empirical analysis of task inclusion ratios verifies that not all data are beneficial and that careful curation drives performance gains.

Applications: Device-Cloud Collaboration Agents

GoClick's utility is demonstrated in collaborative frameworks where planning (reasoning and task decomposition) is performed by proprietary large-scale cloud models, while element grounding is handled by GoClick locally. This two-stage approach, assessed on several agent benchmarks (AITW, AndroidControl, GUIAct), yields dramatic improvements in step success rate and click accuracy over both standalone proprietary models and systems using SoM prompting. GoClick's grounding expertise enables agents to resolve fine-grained visual coordinates, a longstanding bottleneck for proprietary planners.

Figure 6: Device-cloud agent architecture: reasoning and planning by cloud VLMs, spatial grounding by lightweight GoClick deployed on device.

Practical and Theoretical Implications

GoClick establishes that:

Architectural selection tailored for narrow spatial localization tasks results in significant accuracy improvements under severe resource constraints, supporting the premise that encoder-decoder models are preferable at low parameter scales for grounding-like tasks.
Systematic data refinement—removing potentially harmful and redundant training signals—can substantially enhance grounding performance and efficiency, challenging the long-held notion that more data invariably leads to better models, especially under strict deployment requirements.
Division of labor between planning (cloud) and grounding (device) enables a practical pathway toward robust autonomous GUI agents on mobile and embedded systems, expanding the operational frontier of multimodal AI.

The results imply that future developments should further investigate meta-learning or game-theoretic approaches for sample value estimation and explore expanding encoder-decoder architectures to cover more complex GUI interaction tasks, including reasoning and sequential planning.

Conclusion

GoClick exemplifies a strategic combination of lightweight architecture and refined training data for expert GUI element grounding. Empirical evaluation across benchmarks and agent tasks demonstrates strong speed-accuracy tradeoffs and deployment feasibility. Integration into device-cloud multi-stage agents validates GoClick’s practical utility, surpassing both standalone proprietary and competitive SoM-augmented solutions. Challenges remain in generalizing architectural insights to broader GUI tasks and automating the sample selection process, but the foundations laid by GoClick suggest promising avenues for high-performance, resource-constrained multimodal agents.

Markdown Report Issue