WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents (2404.05902v1)

Published 8 Apr 2024 in cs.CL and cs.AI

Abstract: In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthesis technique to optimally populate a black-box LLM's prompt with task demonstrations from previous runs. To maximize end-to-end success rates, we also propose an intelligent backtracking mechanism that learns and recovers from its mistakes. Finally, we show that our ranking model can be trained on data from a generative auto-curriculum which samples representative goals from an LLM, runs the agent, and automatically evaluates it, with no manual annotation. Wilbur achieves state-of-the-art results on the WebVoyager benchmark, beating text-only models by 8% overall, and up to 36% on certain websites. On the same benchmark, Wilbur is within 5% of a strong multi-modal model despite only receiving textual inputs, and further analysis reveals a substantial number of failures are due to engineering challenges of operating the web.

PDF HTML Abstract

Advancing Web Agents Through Wilbur: Enhanced Generalization and Backtracking Capabilities

Introduction

Web agent technology seeks to automate interactions with the vast and diverse ecosystem of the internet, aiming to execute tasks across different websites. A significant hurdle in the development of capable web agents lies in the challenges posed by the high variance in website structures, requiring agents that can not only perform tasks accurately but also generalize across these variations. This paper introduces Wilbur, an innovative approach that enhances the capability of web agents through a suite of novel techniques including a differentiable ranking model for instruction synthesis, intelligent backtracking, and an autocurriculum for learning from web interactions.

Wilbur's Novel Features

Explore, Reflect, and Backtrack

Wilbur differs from traditional web agents by incorporating mechanisms to learn from its actions — both successful and unsuccessful. When faced with a novel website, Wilbur's strategy involves sampling actions from a LLM, executing the sampled action, and then reflecting on the outcome. If the reflection phase determines that the action did not contribute to progressing towards the goal, Wilbur employs a backtracking mechanism to revert to a previously successful state. This process includes a dynamic feedback loop where the agent learns from its mistakes by storing them in the model's context for future reference.

Demonstrations Retrieval and Synthesis

Wilbur introduces an innovative dual approach to leveraging past experiences or demonstrations. It uses goal-conditioned demonstrations to guide the agent on performing tasks on potentially unseen websites and website-conditioned demonstrations to tailor the agent's actions based on the specific nuances of a webpage. The synergy between these two types of knowledge aids in generalizing the agent's capabilities across a wide range of web tasks. A dedicated model then ranks these demonstrations to populate the LLM’s prompt optimally, choosing the most helpful examples to improve task execution.

Autocurriculum for Scalable Learning

Wilbur employs an autocurriculum framework for generating and evaluating goals, thereby facilitating the agent's rapid learning across new websites and tasks. This auto-generative process leverages an LLM for both goal formulation and execution evaluation, enabling the system to quickly acquire a rich dataset of both successful and unsuccessful trajectories. By continuously integrating these insights back into the agent's knowledge base, Wilbur demonstrates a robust self-improving mechanism.

Evaluation and Results

Wilbur was evaluated using the WebVoyager benchmark, where it outperformed the existing state-of-the-art text-only models by achieving a significant 8% increase in the overall success rate. Remarkably, Wilbur demonstrated capabilities close to a strong multi-modal model, bridging the gap to within 5%, despite only being a text-based agent. Furthermore, Wilbur showcased substantial improvements on specific challenging websites, highlighting its ability to navigate complex web structures and interact effectively with diverse components.

Implications and Future Directions

Wilbur's approach to web agent development signifies a substantial forward leap, showcasing the potential benefits of intelligent backtracking, synthesized learning from demonstrations, and an autocurriculum for data generation. The research highlights the role of nuanced learning and adaptation strategies in achieving high performance in task execution across the web. However, it also points to several avenues for further exploration, such as enhancing the agent's interaction capabilities with complex web elements and overcoming engineering limitations inherent in web navigation.

Future work in this area may focus on refining backtracking algorithms, developing more sophisticated web interaction protocols, and exploring the integration of multi-modal inputs to further enhance the agent's understanding and manipulation of web environments. Additionally, addressing the challenges posed by anti-scraping measures and ensuring robust operation across a broader spectrum of web technologies represent important goals for subsequent research endeavors.

Wilbur's advancements underscore the dynamic and evolving nature of web agent technology, offering a compelling blueprint for future innovations in the field. The approach's balance between leveraging past learning and dynamically adapting to new challenges sets a new standard for developing capable and generalizable web agents.