SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Abstract: Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elements based on instructions. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. Along with the efforts above, we have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. After pre-training, SeeClick demonstrates significant improvement in ScreenSpot over various baselines. Moreover, comprehensive evaluations on three widely used benchmarks consistently support our finding that advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks. The model, data and code are available at https://github.com/njucckevin/SeeClick.
- Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Introducing our multimodal models.
- A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, pages 312–328. Springer.
- Object detection for graphical user interface: Old fashioned or deep learning or a combination? In proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1202–1214.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
- Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations.
- Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845–854.
- Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854.
- Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108.
- A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856.
- Understanding html with large language models. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
- Learning to navigate the web. In International Conference on Learning Representations.
- Actionbert: Leveraging user actions for semantic understanding of user interfaces. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5931–5938.
- Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798.
- Language models can solve computer tasks. arXiv preprint arXiv:2303.17491.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
- Gang Li and Yang Li. 2022. Spotlight: Mobile ui understanding using vision-language models with a focus. In The Eleventh International Conference on Learning Representations.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975.
- Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776.
- Widget captioning: Generating natural language description for mobile user interface elements. arXiv preprint arXiv:2010.04295.
- Vut: Versatile ui transformer for multi-modal multi-task user interface modeling. arXiv preprint arXiv:2112.05692.
- Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations.
- Visual instruction tuning. In Neural Information Processing Systems.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
- OpenAI. 2023. GPT-4 technical report.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
- Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088.
- From pixels to ui actions: Learning to follow instructions via graphical user interfaces. In Advances in Neural Information Processing Systems.
- World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR.
- Corex: Pushing the boundaries of complex reasoning through multi-model collaboration. arXiv preprint arXiv:2310.00280.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175.
- Symbol-llm: Towards foundational symbol-centric interface for large language models. arXiv preprint arXiv:2311.09278.
- Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562.
- Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771.
- Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
- Zhuosheng Zhan and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
- Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102.
- Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112.
- Reinforced ui instruction grounding: Towards a generic ui task automation api. arXiv preprint arXiv:2310.04716.
- Synapse: Trajectory-as-exemplar prompting with memory for computer control. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.