Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents (2404.05902v1)

Published 8 Apr 2024 in cs.CL and cs.AI
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

Abstract: In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthesis technique to optimally populate a black-box LLM's prompt with task demonstrations from previous runs. To maximize end-to-end success rates, we also propose an intelligent backtracking mechanism that learns and recovers from its mistakes. Finally, we show that our ranking model can be trained on data from a generative auto-curriculum which samples representative goals from an LLM, runs the agent, and automatically evaluates it, with no manual annotation. Wilbur achieves state-of-the-art results on the WebVoyager benchmark, beating text-only models by 8% overall, and up to 36% on certain websites. On the same benchmark, Wilbur is within 5% of a strong multi-modal model despite only receiving textual inputs, and further analysis reveals a substantial number of failures are due to engineering challenges of operating the web.

Advancing Web Agents Through Wilbur: Enhanced Generalization and Backtracking Capabilities

Introduction

Web agent technology seeks to automate interactions with the vast and diverse ecosystem of the internet, aiming to execute tasks across different websites. A significant hurdle in the development of capable web agents lies in the challenges posed by the high variance in website structures, requiring agents that can not only perform tasks accurately but also generalize across these variations. This paper introduces Wilbur, an innovative approach that enhances the capability of web agents through a suite of novel techniques including a differentiable ranking model for instruction synthesis, intelligent backtracking, and an autocurriculum for learning from web interactions.

Wilbur's Novel Features

Explore, Reflect, and Backtrack

Wilbur differs from traditional web agents by incorporating mechanisms to learn from its actions — both successful and unsuccessful. When faced with a novel website, Wilbur's strategy involves sampling actions from a LLM, executing the sampled action, and then reflecting on the outcome. If the reflection phase determines that the action did not contribute to progressing towards the goal, Wilbur employs a backtracking mechanism to revert to a previously successful state. This process includes a dynamic feedback loop where the agent learns from its mistakes by storing them in the model's context for future reference.

Demonstrations Retrieval and Synthesis

Wilbur introduces an innovative dual approach to leveraging past experiences or demonstrations. It uses goal-conditioned demonstrations to guide the agent on performing tasks on potentially unseen websites and website-conditioned demonstrations to tailor the agent's actions based on the specific nuances of a webpage. The synergy between these two types of knowledge aids in generalizing the agent's capabilities across a wide range of web tasks. A dedicated model then ranks these demonstrations to populate the LLM’s prompt optimally, choosing the most helpful examples to improve task execution.

Autocurriculum for Scalable Learning

Wilbur employs an autocurriculum framework for generating and evaluating goals, thereby facilitating the agent's rapid learning across new websites and tasks. This auto-generative process leverages an LLM for both goal formulation and execution evaluation, enabling the system to quickly acquire a rich dataset of both successful and unsuccessful trajectories. By continuously integrating these insights back into the agent's knowledge base, Wilbur demonstrates a robust self-improving mechanism.

Evaluation and Results

Wilbur was evaluated using the WebVoyager benchmark, where it outperformed the existing state-of-the-art text-only models by achieving a significant 8% increase in the overall success rate. Remarkably, Wilbur demonstrated capabilities close to a strong multi-modal model, bridging the gap to within 5%, despite only being a text-based agent. Furthermore, Wilbur showcased substantial improvements on specific challenging websites, highlighting its ability to navigate complex web structures and interact effectively with diverse components.

Implications and Future Directions

Wilbur's approach to web agent development signifies a substantial forward leap, showcasing the potential benefits of intelligent backtracking, synthesized learning from demonstrations, and an autocurriculum for data generation. The research highlights the role of nuanced learning and adaptation strategies in achieving high performance in task execution across the web. However, it also points to several avenues for further exploration, such as enhancing the agent's interaction capabilities with complex web elements and overcoming engineering limitations inherent in web navigation.

Future work in this area may focus on refining backtracking algorithms, developing more sophisticated web interaction protocols, and exploring the integration of multi-modal inputs to further enhance the agent's understanding and manipulation of web environments. Additionally, addressing the challenges posed by anti-scraping measures and ensuring robust operation across a broader spectrum of web technologies represent important goals for subsequent research endeavors.

Wilbur's advancements underscore the dynamic and evolving nature of web agent technology, offering a compelling blueprint for future innovations in the field. The approach's balance between leveraging past learning and dynamically adapting to new challenges sets a new standard for developing capable and generalizable web agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. BYOC: Personalized few-shot classification with co-authored class descriptions. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  13999–14015, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.933. URL https://aclanthology.org/2023.findings-emnlp.933.
  2. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  3. Bootstrapping pos-taggers using unlabelled data. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pp.  49–55, 2003.
  4. Mind2web: Towards a generalist agent for the web, 2023.
  5. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023.
  6. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  3816–3830, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295.
  7. Understanding html with large language models. arXiv preprint arXiv:2210.03945, 2022.
  8. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  9. Katherine Haan. Top website statistics for 2023. Forbes, 2023.
  10. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
  11. Measuring massive multitask language understanding, 2021.
  12. Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents, 2024.
  13. Language models can solve computer tasks. Advances in Neural Information Processing Systems, 36, 2024.
  14. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  537–563, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.38.
  15. Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802, 2018.
  16. Laser: Llm agent with state-space exploration for web navigation. arXiv preprint arXiv:2309.08172, 2023.
  17. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp.  152–159, New York City, USA, June 2006. Association for Computational Linguistics. URL https://aclanthology.org/N06-1020.
  18. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  19. OpenAI. Gpt-4 technical report, 2023.
  20. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pp.  3135–3144. PMLR, 2017.
  21. Hierarchical prompting assists large language model on web navigation. arXiv preprint arXiv:2305.14257, 2023.
  22. Adaplanner: Adaptive planning from feedback with language models. Advances in Neural Information Processing Systems, 36, 2024.
  23. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291, 2023.
  24. Grounding open-domain instructions to automate web support tasks, 2021.
  25. Webshop: Towards scalable real-world web interaction with grounded language agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  20744–20757. Curran Associates, Inc., 2022a. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf.
  26. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022b.
  27. Gpt-4v(ision) is a generalist web agent, if grounded, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Michael Lutz (4 papers)
  2. Arth Bohra (3 papers)
  3. Manvel Saroyan (2 papers)
  4. Artem Harutyunyan (3 papers)
  5. Giovanni Campagna (15 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com