Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPT-4V(ision) is a Generalist Web Agent, if Grounded (2401.01614v2)

Published 3 Jan 2024 in cs.IR, cs.AI, cs.CL, and cs.CV
GPT-4V(ision) is a Generalist Web Agent, if Grounded

Abstract: The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement. All code, data, and evaluation tools are available at https://github.com/OSU-NLP-Group/SeeAct.

GPT-4V(ision) is a Generalist Web Agent, if Grounded

The paper "GPT-4V(ision) is a Generalist Web Agent, if Grounded" authored by Boyuan Zheng et al. from The Ohio State University explores the burgeoning potential of large multimodal models (LMMs) such as GPT-4V(ision) in the field of web navigation tasks. The researchers explore how these models, when appropriately grounded, can act as robust generalist web agents capable of handling diverse tasks across various websites. This paper takes inspiration from recent advancements in LMMs and employs them to develop a generalist web agent with a specific focus on addressing the grounding challenge.

Overview

The authors introduce SeeAct, a novel approach to leverage the capabilities of GPT-4V for web navigation tasks by integrating visual understanding and textual planning. The evaluation is conducted using the Mind2Web benchmark, which provides a rigorous set of tasks and real-world website interactions. The experiments encompass both offline evaluation on cached websites and a unique online evaluation on live websites, providing a comprehensive assessment of the model's real-world applicability.

Methodology

SeeAct is designed to balance action generation and grounding:

  1. Action Generation: Utilizing GPT-4V to generate a detailed textual plan based on the visual context of the webpage and the task requirements.
  2. Action Grounding: Converting the textual plan into executable actions. Three grounding strategies are explored:
    • Grounding via Element Attributes: Using heuristic searches based on detailed descriptions of target elements.
    • Grounding via Textual Choices: Employing a candidate ranking model to select elements from textual descriptions.
    • Grounding via Image Annotation: Adding visual labels to elements and requiring the model to generate corresponding labels accurately.

Key Findings

  • Performance: SeeAct with GPT-4V exhibits significant potential, achieving a task completion rate of 51.1% on live websites with oracle grounding. This outperforms text-only models like GPT-4 significantly, which achieved a completion rate of 13.3%.
  • Grounding Challenge: Despite the promise shown by LMMs, grounding remains a significant bottleneck. Among the grounding strategies, textual choices proved the most effective, while image annotation faced substantial issues with hallucinations and spatial linking errors.
  • Evaluation Discrepancy: There is a notable discrepancy between offline and online evaluations, with online evaluations providing a more accurate measure of a model's performance due to the dynamic nature of the web and the presence of multiple viable task completion plans.

Implications and Future Directions

The paper's implications are multi-faceted, addressing both theoretical and practical realms:

  • Web Accessibility and Automation: The potential of LMMs like GPT-4V as generalist web agents can significantly enhance web accessibility and automate complex sequences of actions on websites, aiding users with disabilities and streamlining routine tasks.
  • Improvement in Grounding Techniques: The persistent gap between current grounding methods and oracle grounding highlights the need for further research. Better utilization of the unique properties of web environments, such as the correspondence between HTML and visual elements, could mitigate hallucinations and improve model accuracy.
  • Evaluation Metrics: The difference between offline and online evaluations suggests that future models should be tested dynamically on live websites to ensure robust performance in real-world scenarios.

Conclusion

The research presents a thorough and insightful analysis of employing LMMs for web navigation tasks, emphasizing the critical role of grounding in converting multimodal model capabilities into practical, real-world applications. While GPT-4V and similar models hold substantial promise, addressing the grounding challenge remains pivotal for realizing their full potential as generalist web agents. Future work in this area should focus on refining grounding strategies and possibly developing new evaluation frameworks to better capture the dynamic and multifaceted nature of web tasks. This paper lays a strong foundation for subsequent advancements in the intersection of web automation and multimodal AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Cm3: A causal masked multimodal model of the internet. ArXiv, abs/2201.07520, 2022.
  2. An in-depth look at gemini’s language abilities. 2023.
  3. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
  4. Gemini: A family of highly capable multimodal models. 2023.
  5. A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, 2022.
  6. Grounding ‘grounding’in nlp. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4283–4305, 2021.
  7. Shikra: Unleashing multimodal llm’s referential dialogue magic. ArXiv, abs/2306.15195, 2023a.
  8. Pali: A jointly-scaled multilingual language-image model. ArXiv, abs/2209.06794, 2022.
  9. Pali-x: On scaling up a multilingual vision and language model. ArXiv, abs/2305.18565, 2023b.
  10. Scaling instruction-finetuned language models. ArXiv, abs/2210.11416, 2022.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023.
  12. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
  13. Palm-e: An embodied multimodal language model. In International Conference on Machine Learning, 2023.
  14. Multimodal web navigation with instruction-finetuned foundation models. ArXiv, abs/2305.11854, 2023.
  15. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127:398 – 414, 2016.
  16. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4928–4949, Toronto, Canada, 2023. Association for Computational Linguistics.
  17. Hallusionbench: An advanced diagnostic suite for entangled language hallucination&visual illusion in large vision-language models. 2023.
  18. A real-world webagent with planning, long context understanding, and program synthesis. ArXiv, abs/2307.12856, 2023.
  19. Measuring massive multitask language understanding. ArXiv, abs/2009.03300, 2020.
  20. Cogagent: A visual language model for gui agents. 2023.
  21. Referitgame: Referring to objects in photographs of natural scenes. In Conference on Empirical Methods in Natural Language Processing, 2014.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023.
  23. Mapping natural language instructions to mobile ui action sequences. ArXiv, abs/2005.03776, 2020.
  24. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  25. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018.
  26. Visual instruction tuning. ArXiv, abs/2304.08485, 2023a.
  27. Agentbench: Evaluating llms as agents. ArXiv, abs/2308.03688, 2023b.
  28. Learn to explain: Multimodal reasoning via thought chains for science question answering. ArXiv, abs/2209.09513, 2022.
  29. Flin: A flexible natural language interface for web navigation. ArXiv, abs/2010.12844, 2020.
  30. Metavl: Transferring in-context learning ability from language models to vision-language models. ArXiv, abs/2306.01311, 2023.
  31. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  32. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022.
  33. Instruction tuning with gpt-4. ArXiv, abs/2304.03277, 2023a.
  34. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306.14824, 2023b.
  35. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 123:74 – 93, 2015.
  36. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  37. Scienceqa: a novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23:289 – 301, 2022.
  38. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021.
  39. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, 2017.
  40. What does clip know about a red circle? visual prompt engineering for vlms. ArXiv, abs/2304.06712, 2023.
  41. Meta-gui: Towards multi-modal conversational agents on mobile gui. In Conference on Empirical Methods in Natural Language Processing, 2022.
  42. Aligning large multimodal models with factually augmented rlhf. ArXiv, abs/2309.14525, 2023.
  43. Git: A generative image-to-text transformer for vision and language. ArXiv, abs/2205.14100, 2022.
  44. Simvlm: Simple visual language model pretraining with weak supervision. ArXiv, abs/2108.10904, 2021.
  45. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. 2023.
  46. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. ArXiv, abs/2310.11441, 2023a.
  47. Fine-grained visual prompting. ArXiv, abs/2306.04356, 2023b.
  48. The dawn of lmms: Preliminary explorations with gpt-4v(ision). ArXiv, abs/2309.17421, 2023c.
  49. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023d.
  50. Webshop: Towards scalable real-world web interaction with grounded language agents. ArXiv, abs/2207.01206, 2022.
  51. Ferret: Refer and ground anything anywhere at any granularity. ArXiv, abs/2310.07704, 2023.
  52. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. ArXiv, abs/2309.02591, 2023.
  53. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. ArXiv, abs/2311.16502, 2023.
  54. Gpt-4v(ision) as a generalist evaluator for vision-language tasks. ArXiv, abs/2311.01361, 2023.
  55. Bubogpt: Enabling visual grounding in multi-modal llms. ArXiv, abs/2307.08581, 2023.
  56. Agieval: A human-centric benchmark for evaluating foundation models. ArXiv, abs/2304.06364, 2023.
  57. Scene parsing through ade20k dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017.
  58. Webarena: A realistic web environment for building autonomous agents. ArXiv, abs/2307.13854, 2023.
  59. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Boyuan Zheng (27 papers)
  2. Boyu Gou (7 papers)
  3. Jihyung Kil (10 papers)
  4. Huan Sun (88 papers)
  5. Yu Su (138 papers)
Citations (129)
Youtube Logo Streamline Icon: https://streamlinehq.com