EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data (2410.19461v2)
Abstract: Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the LLM-based methods which rely on structured texts and customized backends, the approaches using large vision-LLMs (LVLMs) are more intuitive and adaptable as they can visually perceive and directly interact with screens, making them indispensable in general scenarios without text metadata and tailored backends. Given the lack of high-quality training data for GUI-related tasks in existing work, this paper aims to enhance the GUI understanding and interacting capabilities of LVLMs through a data-driven approach. We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web. Evaluation results on various GUI and agent benchmarks demonstrate that the model trained with the dataset generated through EDGE exhibits superior webpage understanding capabilities, which can then be easily transferred to previously unseen desktop and mobile environments. Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work. Our source code, the dataset and the model are available at https://anonymous.4open.science/r/EDGE-1CDB.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
- Anthropic. 2023. Model Card and Evaluations for Claude Models. https://www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226/Model-Card-Claude-2.pdf.
- Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731 (2021).
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
- Richard A Bolt. 1980. “Put-that-there” Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques. 262–270.
- A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision. Springer, 312–328.
- End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023).
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023).
- Symbolic discovery of optimization algorithms. Advances in neural information processing systems 36 (2024).
- Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935 (2024).
- Scaling instruction-finetuned language models. Journal of Machine Learning Research 25, 70 (2024), 1–53.
- ddupont808. [n. d.]. GPT-4V-Act. https://github.com/ddupont808/GPT-4V-Act
- Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology. 845–854.
- Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems 36 (2024).
- Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854 (2023).
- Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108 (2023).
- Enhancing vision-language pre-training with rich supervisions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13480–13491.
- A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856 (2023).
- Learning to navigate the web. arXiv preprint arXiv:1812.09195 (2018).
- Actionbert: Leveraging user actions for semantic understanding of user interfaces. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5931–5938.
- CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV]
- A data-driven approach for learning to control computers. In International Conference on Machine Learning. PMLR, 9466–9482.
- Language models can solve computer tasks. Advances in Neural Information Processing Systems 36 (2024).
- AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent. arXiv preprint arXiv:2404.03648 (2024).
- Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9579–9589.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning. PMLR, 18893–18912.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36 (2024).
- Gang Li and Yang Li. 2022. Spotlight: Mobile ui understanding using vision-language models with a focus. arXiv preprint arXiv:2209.14927 (2022).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
- Mapping natural language instructions to mobile UI action sequences. arXiv preprint arXiv:2005.03776 (2020).
- Widget captioning: Generating natural language description for mobile user interface elements. arXiv preprint arXiv:2010.04295 (2020).
- Monkey: Image resolution and text label are important things for large multi-modal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26763–26773.
- Henry Lieberman et al. 1995. Letizia: An agent that assists web browsing. IJCAI (1) 1995 (1995), 924–929.
- Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802 (2018).
- Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26296–26306.
- Llava-next: Improved reasoning, ocr, and world knowledge.
- Visual instruction tuning. Advances in neural information processing systems 36 (2024).
- VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? arXiv preprint arXiv:2404.05955 (2024).
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).
- FineWeb-Edu. https://doi.org/10.57967/hf/2497
- ScreenAgent: A Vision Language Model-driven Computer Control Agent. (2024). arXiv:2402.07945 [cs.HC]
- OpenAI. 2022. Introducing chatgpt.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023).
- Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems 36 (2024).
- From pixels to ui actions: Learning to follow instructions via graphical user interfaces. Advances in Neural Information Processing Systems 36 (2023), 34354–34370.
- World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning. PMLR, 3135–3144.
- Significant Gravitas. [n. d.]. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT
- How to bridge the gap between modalities: A comprehensive survey on multimodal large language model. arXiv preprint arXiv:2311.07594 (2023).
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Screen2words: Automatic mobile UI summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. 498–510.
- Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158 (2024).
- Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191 (2024).
- CogVLM: Visual Expert for Pretrained Language Models. arXiv:2311.03079 [cs.CV]
- AutoDroid: LLM-powered Task Automation in Android. arXiv:2308.15272 [cs.AI] https://arxiv.org/abs/2308.15272
- Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456 (2024).
- Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562 (2023).
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023).
- Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771 (2023).
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13040–13051.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023).
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. arXiv preprint arXiv:2404.05719 (2024).
- Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601 (2024).
- Llava-grounding: Grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949 (2023).
- Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023).
- Zhuosheng Zhang and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436 (2023).
- Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581 (2023).
- Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614 (2024).
- Synapse: Trajectory-as-exemplar prompting with memory for computer control. In The Twelfth International Conference on Learning Representations.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 (2023).
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).