WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Published 12 Mar 2024 in cs.LG and cs.AI | (2403.07718v5)

Abstract: We study the use of LLM-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

Abstract PDF HTML Upgrade to Chat

References (30)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces WorkArena, a novel benchmark that tests web agents on 29 enterprise tasks via over 23,000 unique instances.
It presents BrowserGym, a Python-based environment offering rich multimodal observations and flexible action spaces to support agent development.
Empirical evaluations reveal that larger models, such as GPT-4, perform significantly better, underscoring challenges in automating complex enterprise tasks.

WorkArena and BrowserGym: Benchmarking and Environment for Web Agents in Enterprise Applications

Introduction

The advent of LLMs and their application in generative AI have unlocked potential new frontiers in automated web interaction, opening up avenues for intelligent agents that can navigate and execute tasks on web interfaces. This brings particular attention to the enterprise domain, where the intricacies of software systems often parallel the complexity found in real-life work scenarios. In this context, we introduce WorkArena, a comprehensive benchmark that rigorously tests the proficiency of web agents in accomplishing a spectrum of tasks typically encountered by knowledge workers in enterprise environments. Furthermore, we present BrowserGym, a robust environment designed for the development and evaluation of such agents, offering an extensive array of actions along with multimodal observations.

WorkArena: A New Benchmark for Enterprise Software Automation

WorkArena is crafted around the ServiceNow platform, a cloud-based suite employed for digital workflow automation across various sectors. The benchmark comprises 29 distinct tasks, yielding over 23,150 unique instances, collectively embodying daily operations within enterprise software systems. Specifically, it includes:

List operations: Filtering and sorting through data tables,
Form interactions: Completing and submitting web forms with varying complexity,
Knowledge base navigation: Searching for information and retrieving specific details,
Service Catalog navigation: Browsing and ordering items with precise configurations,
Menu navigation: Utilizing software menus for application navigation and user impersonation.

This benchmark aims to mimic real-world applications closely, presenting challenges such as dynamic and non-standard user interfaces, extensive document object models (DOMs), and the presence of advanced web technologies like iFrames and shadow DOMs.

BrowserGym: A Versatile Environment for Web Agent Development

BrowserGym stands as a Python-based environment that supports designing and evaluating web agents with features beyond those offered by its predecessors. Key features include:

Chat-based user interactions: Allowing for dynamic task instructions and responses,
Augmented DOM attributes: Providing each element with unique identifiers, screen coordinates, visibility tags, and bounding boxes,
Rich observation space: Including the DOM, an accessibility tree, viewport screenshots, and error messages for comprehensive understanding,
Flexible action space: Facilitating interactions through Python code, high-level primitives, and coordinate-based inputs,
Multi-page navigation support: Ensuring compatibility with complex web applications.

By accommodating a vast range of observations and actions, BrowserGym enables experimentation with diverse agent architectures, including text-only, vision-augmented, memory-augmented agents, among others. Furthermore, the environment's flexibility minimizes the effort required to create new benchmarks or port existing ones, demonstrated by the seamless integration of MiniWoB within BrowserGym.

Empirical Evaluation and Insights

Our empirical analysis reveals that WorkArena poses a significant challenge for current web agents, highlighting a substantial gap in achieving full task automation. The evaluation of state-of-the-art LLMs, including GPT-4, GPT-3.5, and CodeLlama, indicates promising capabilities in simpler benchmarks like MiniWoB but underscores the need for further advancement in handling complex enterprise tasks. Interestingly, the experimental outcomes also emphasize the pivotal role of model size and architecture, with larger models like GPT-4 showing notably better performance.

BrowserGym's expansive feature set proved beneficial in enhancing agent performance across various scenarios. However, in the context of WorkArena, certain features, particularly those relating to 2D interactions, did not significantly contribute to agent performance, possibly due to the benchmark's design focusing on tasks solvable through high-level actions.

Future Directions

The development and implementation of WorkArena and BrowserGym mark significant strides in exploring the capabilities of web agents within enterprise software systems. Looking forward, integrating additional benchmarks and expanding the task diversity in WorkArena can provide further insights into the emergent properties of AI systems and their potential impact on automating knowledge work. This could eventually lead to the realization of highly efficient, intelligent agents capable of augmenting human productivity in the digital field.

Conclusion

WorkArena and BrowserGym represent important contributions to the field of LLMs and generative AI in the context of web automation. By offering a realistic and challenging benchmark coupled with a versatile testing environment, they set the stage for advancing our understanding and capabilities in automating complex, real-world tasks encountered in enterprise applications.

Markdown