Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 229 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization (2410.19609v1)

Published 25 Oct 2024 in cs.CL and cs.AI

Abstract: The rapid development of large language and multimodal models has sparked significant interest in using proprietary models, such as GPT-4o, to develop autonomous agents capable of handling real-world scenarios like web navigation. Although recent open-source efforts have tried to equip agents with the ability to explore environments and continuously improve over time, they are building text-only agents in synthetic environments where the reward signals are clearly defined. Such agents struggle to generalize to realistic settings that require multimodal perception abilities and lack ground-truth signals. In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously conduct real-world exploration and improve itself. We first train the base model with imitation learning to gain the basic abilities. We then let the agent explore the open web and collect feedback on its trajectories. After that, it further improves its policy by learning from well-performing trajectories judged by another general-purpose model. This exploration-feedback-optimization cycle can continue for several iterations. Experimental results show that our web agent successfully improves itself after each iteration, demonstrating strong performance across multiple test sets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. AI Anthropic. 2024. Introducing the next generation of claude.
  2. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36.
  3. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856.
  4. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919.
  5. Bootstrapping vision-language learning with decoupled language pre-training. Advances in Neural Information Processing Systems, 36.
  6. Dual-view visual contextualization for web navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14445–14454.
  7. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649.
  8. What matters when building vision-language models? arXiv preprint arXiv:2405.02246.
  9. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Preprint, arXiv:2306.16527.
  10. What matters when building vision-language models? Preprint, arXiv:2405.02246.
  11. Scaffolding coordinates to promote vision-language coordination in large multi-modal models. arXiv preprint arXiv:2402.12058.
  12. Visual instruction tuning. Advances in neural information processing systems, 36.
  13. Wilbur: Adaptive in-context learning for robust and accurate web agents. arXiv preprint arXiv:2404.05902.
  14. Laser: Llm agent with state-space exploration for web navigation. Preprint, arXiv:2309.08172.
  15. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  16. OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  17. OpenAI. 2024. Hello gpt-4o.
  18. Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309.
  19. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199.
  20. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  21. Trial and error: Exploration-based trajectory optimization for llm agents. Preprint, arXiv:2403.02502.
  22. Curriculum learning: A survey. Preprint, arXiv:2101.10382.
  23. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  24. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560.
  25. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  26. Agentgym: Evolving large language model-based agents across diverse environments. arXiv preprint arXiv:2406.04151.
  27. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441.
  28. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  29. Cognitive kernel: An open-source agent system towards generalist autopilots.
  30. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614.
  31. Gpt-4v(ision) is a generalist web agent, if grounded. Preprint, arXiv:2401.01614.
  32. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 6 tweets and received 4 likes.

Upgrade to Pro to view all of the tweets about this paper: