WebArena: A Realistic Web Environment for Building Autonomous Agents (2307.13854v4)

Published 25 Jul 2023 in cs.AI, cs.CL, and cs.LG

Abstract: With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art LLMs are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

PDF Abstract

WebArena: A Realistic Web Environment for Building Autonomous Agents

The paper "WebArena: A Realistic Web Environment for Building Autonomous Agents," addresses the current limitations in autonomous agent development and evaluation by introducing a new, realistic, and replicable environment specifically designed for web-based tasks. The authors aim to bridge the gap between existing synthetic environments and real-world applications by providing a comprehensive setting that closely mimics genuine web interactions.

Context and Contributions

Current autonomous agents, often tested in oversimplified environments, struggle to generalize to real-world scenarios. The authors identify this disparity and attempt to resolve it by developing WebArena, which offers a highly realistic web environment. WebArena encompasses four fully functional websites across distinct domains: e-commerce, social forums, collaborative software development, and content management. These sites are enriched with authentic data and functionalities found in their real-world counterparts. Additionally, the environment integrates various utility tools and extensive documentation resources to facilitate complex task execution.

The paper significantly contributes by releasing a benchmark of 812 diverse long-horizon tasks, which are representative of commonplace web-based human activities. These tasks prioritize evaluating functional correctness over simple action sequence prediction, thereby providing a more reliable indication of an agent's real-world applicability.

Experimental Setup and Results

The authors implemented several baseline agents within this environment, leveraging state-of-the-art LLMs such as GPT-4 and PALM-2. Despite their capabilities, these models face considerable challenges when tasked with complex interactions. The experimental results demonstrate a substantial performance gap between the best model, a GPT-4-based agent with a 14.41% success rate, and human performance, which stands at 78.24%. This underperformance highlights the inadequacies of current LLMs in understanding and executing long-chain tasks that involve human-like reasoning and decision-making processes.

Analysis and Implications

The paper indicates that present-day LLMs lack the essential attributes of active exploration and recovery from failures necessary for executing intricate tasks in real-world environments. Experiments revealed a common error pattern of early stopping due to incorrect assumptions of task unfeasibility, suggesting a critical area for improvement. These findings suggest the urgent need for the development of more robust language-guided agents capable of handling the complexities of web-based tasks.

WebArena offers a crucial testbed for advancing research in the AI domain, presenting opportunities to improve exploration strategies, understanding hierarchical planning, navigation, and nuanced task execution. Future developments in AI should incorporate learnings from WebArena to enhance adaptability, reliability, and efficiency in environments that closely resemble the complexity of human activities.

Future Directions

WebArena sets a foundation for future exploration into memory incorporation techniques and strategies for effective experiential learning, whereby agents can adapt from past trials to streamline task execution in novel instances. Furthermore, enhancing agent capabilities for recovering from errors, self-correcting, and developing conceptual task understanding remains crucial.

Overall, WebArena presents a significant step forward in AI environment design, compelling the AI community to innovate towards models that can more effectively simulate nuanced human interactions and decision-making processes on the web.