Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (2401.10935v2)

Published 17 Jan 2024 in cs.HC and cs.AI

Abstract: Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elements based on instructions. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. Along with the efforts above, we have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. After pre-training, SeeClick demonstrates significant improvement in ScreenSpot over various baselines. Moreover, comprehensive evaluations on three widely used benchmarks consistently support our finding that advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks. The model, data and code are available at https://github.com/njucckevin/SeeClick.

Overview of "SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents"

The development of graphical user interface (GUI) agents that automate complex tasks across devices like desktops and smartphones is a salient topic in the domain of artificial intelligence. The paper presents "SeeClick," a novel visual GUI agent that enhances task automation through an innovative approach relying solely on interface screenshots. This visual-only methodology marks a departure from traditional GUI agents, addressing significant limitations associated with structured text-based interaction methods.

Core Contributions

  1. Introduction of SeeClick: The proposed SeeClick utilizes large vision-LLMs (LVLMs) to perform fundamental operations by observing screenshots. It foregoes the need for structured texts, which are often not readily accessible and can be cumbersome to process. Inspired by human interactions with GUIs, SeeClick can adapt to diverse GUI platforms, offering a unified approach to automating GUI tasks.
  2. GUI Grounding Pre-training: A central challenge identified in the paper is GUI grounding—the ability to accurately localize screen elements based on instructions. To tackle this, the authors enhance LVLMs with GUI grounding pre-training. They propose a method for automatically curating grounding data from web and mobile environments, thereby enabling the accurate localization of various GUI elements like text, widgets, and icons.
  3. ScreenSpot Benchmark: The paper introduces "ScreenSpot," the first comprehensive GUI grounding benchmark, encompassing mobile, desktop, and web environments. This benchmark is crucial for evaluating the effectiveness of visual GUI agents like SeeClick.
  4. Empirical Evaluation: SeeClick's performance is evaluated on the ScreenSpot benchmark and across multiple GUI agent tasks, such as MiniWob, AITW, and Mind2Web. The agent outperforms existing baselines, demonstrating that improvements in GUI grounding lead to enhanced downstream task performance.
  5. Addressing Diverse GUI Platforms: The research includes an expansive collection of GUI data from web pages, mobile interfaces, and general vision-language instruction-following datasets. This ensures that SeeClick is trained on a comprehensive data set, facilitating robustness across various application scenarios.

Numerical Results and Claims

  • ScreenSpot Evaluation: SeeClick significantly surpasses existing models in GUI grounding tasks across different platforms. The paper highlights that even with a smaller model size, SeeClick outperforms alternatives like CogAgent, showcasing the effectiveness of its grounding approach.
  • Downstream Task Performance: Comprehensive evaluations demonstrate SeeClick's superiority in agent task performance. For instance, in the MiniWob benchmark, SeeClick achieves a markedly higher success rate than visual baseline models using a fraction of the training data. This performance strongly correlates with its advanced GUI grounding capability.

Implications and Future Directions

The implications of this research are manifold. On a practical level, SeeClick paves the way for developing more responsive and efficient GUI automation tools that require minimal human intervention and adapt seamlessly across platforms. Theoretically, these findings underscore the importance of GUI grounding as an underexplored yet vital component in enhancing interaction capabilities of visual GUI agents.

Future developments could focus on expanding the range of actions beyond clicking and typing, integrating more complex operations like dragging or multistep interactions. Additionally, leveraging SeeClick’s architecture in new environments, potentially incorporating real-world scenarios with privacy concerns, can reveal further capabilities and limitations. Addressing bias within datasets and ensuring GUI agents' safe application are also critical areas for ongoing research.

In conclusion, the "SeeClick" paper presents a well-founded contribution to advancing GUI agent research, providing valuable insights into the design and training of visual-based automation systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  3. Introducing our multimodal models.
  4. A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, pages 312–328. Springer.
  5. Object detection for graphical user interface: Old fashioned or deep learning or a combination? In proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1202–1214.
  6. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
  7. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
  8. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations.
  9. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845–854.
  10. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  12. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854.
  13. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108.
  14. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856.
  15. Understanding html with large language models. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
  16. Learning to navigate the web. In International Conference on Learning Representations.
  17. Actionbert: Leveraging user actions for semantic understanding of user interfaces. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5931–5938.
  18. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914.
  19. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  20. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798.
  21. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491.
  22. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR.
  23. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  24. Gang Li and Yang Li. 2022. Spotlight: Mobile ui understanding using vision-language models with a focus. In The Eleventh International Conference on Learning Representations.
  25. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975.
  26. Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776.
  27. Widget captioning: Generating natural language description for mobile user interface elements. arXiv preprint arXiv:2010.04295.
  28. Vut: Versatile ui transformer for multi-modal multi-task user interface modeling. arXiv preprint arXiv:2112.05692.
  29. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations.
  30. Visual instruction tuning. In Neural Information Processing Systems.
  31. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  32. OpenAI. 2023. GPT-4 technical report.
  33. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
  34. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088.
  35. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. In Advances in Neural Information Processing Systems.
  36. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR.
  37. Corex: Pushing the boundaries of complex reasoning through multi-model collaboration. arXiv preprint arXiv:2310.00280.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  39. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510.
  40. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175.
  41. Symbol-llm: Towards foundational symbol-centric interface for large language models. arXiv preprint arXiv:2311.09278.
  42. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562.
  43. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771.
  44. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  45. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  46. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
  47. Zhuosheng Zhan and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
  48. Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102.
  49. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112.
  50. Reinforced ui instruction grounding: Towards a generic ui task automation api. arXiv preprint arXiv:2310.04716.
  51. Synapse: Trajectory-as-exemplar prompting with memory for computer control. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
  52. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.
  53. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Kanzhi Cheng (14 papers)
  2. Qiushi Sun (26 papers)
  3. Yougang Chu (3 papers)
  4. Fangzhi Xu (22 papers)
  5. Yantao Li (13 papers)
  6. Jianbing Zhang (29 papers)
  7. Zhiyong Wu (171 papers)
Citations (69)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com