SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (2401.10935v2)

Published 17 Jan 2024 in cs.HC and cs.AI

Abstract: Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elements based on instructions. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. Along with the efforts above, we have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. After pre-training, SeeClick demonstrates significant improvement in ScreenSpot over various baselines. Moreover, comprehensive evaluations on three widely used benchmarks consistently support our finding that advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks. The model, data and code are available at https://github.com/njucckevin/SeeClick.

PDF HTML Abstract

Overview of "SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents"

The development of graphical user interface (GUI) agents that automate complex tasks across devices like desktops and smartphones is a salient topic in the domain of artificial intelligence. The paper presents "SeeClick," a novel visual GUI agent that enhances task automation through an innovative approach relying solely on interface screenshots. This visual-only methodology marks a departure from traditional GUI agents, addressing significant limitations associated with structured text-based interaction methods.

Core Contributions

Introduction of SeeClick: The proposed SeeClick utilizes large vision-LLMs (LVLMs) to perform fundamental operations by observing screenshots. It foregoes the need for structured texts, which are often not readily accessible and can be cumbersome to process. Inspired by human interactions with GUIs, SeeClick can adapt to diverse GUI platforms, offering a unified approach to automating GUI tasks.
GUI Grounding Pre-training: A central challenge identified in the paper is GUI grounding—the ability to accurately localize screen elements based on instructions. To tackle this, the authors enhance LVLMs with GUI grounding pre-training. They propose a method for automatically curating grounding data from web and mobile environments, thereby enabling the accurate localization of various GUI elements like text, widgets, and icons.
ScreenSpot Benchmark: The paper introduces "ScreenSpot," the first comprehensive GUI grounding benchmark, encompassing mobile, desktop, and web environments. This benchmark is crucial for evaluating the effectiveness of visual GUI agents like SeeClick.
Empirical Evaluation: SeeClick's performance is evaluated on the ScreenSpot benchmark and across multiple GUI agent tasks, such as MiniWob, AITW, and Mind2Web. The agent outperforms existing baselines, demonstrating that improvements in GUI grounding lead to enhanced downstream task performance.
Addressing Diverse GUI Platforms: The research includes an expansive collection of GUI data from web pages, mobile interfaces, and general vision-language instruction-following datasets. This ensures that SeeClick is trained on a comprehensive data set, facilitating robustness across various application scenarios.

Numerical Results and Claims

ScreenSpot Evaluation: SeeClick significantly surpasses existing models in GUI grounding tasks across different platforms. The paper highlights that even with a smaller model size, SeeClick outperforms alternatives like CogAgent, showcasing the effectiveness of its grounding approach.
Downstream Task Performance: Comprehensive evaluations demonstrate SeeClick's superiority in agent task performance. For instance, in the MiniWob benchmark, SeeClick achieves a markedly higher success rate than visual baseline models using a fraction of the training data. This performance strongly correlates with its advanced GUI grounding capability.

Implications and Future Directions

The implications of this research are manifold. On a practical level, SeeClick paves the way for developing more responsive and efficient GUI automation tools that require minimal human intervention and adapt seamlessly across platforms. Theoretically, these findings underscore the importance of GUI grounding as an underexplored yet vital component in enhancing interaction capabilities of visual GUI agents.

Future developments could focus on expanding the range of actions beyond clicking and typing, integrating more complex operations like dragging or multistep interactions. Additionally, leveraging SeeClick’s architecture in new environments, potentially incorporating real-world scenarios with privacy concerns, can reveal further capabilities and limitations. Addressing bias within datasets and ensuring GUI agents' safe application are also critical areas for ongoing research.

In conclusion, the "SeeClick" paper presents a well-founded contribution to advancing GUI agent research, providing valuable insights into the design and training of visual-based automation systems.

PDF Markdown Bookmark Chat (Pro)

References (53)

Authors (7)

Kanzhi Cheng (14 papers)
Qiushi Sun (26 papers)
Yougang Chu (3 papers)
Fangzhi Xu (22 papers)
Yantao Li (13 papers)
Jianbing Zhang (29 papers)
Zhiyong Wu (171 papers)

Citations (69)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - njucckevin/SeeClick: The model, data and code for the visual GUI Agent SeeClick (395 stars)

Tweets

https://twitter.com/gastronomy/status/1749641705970622959