WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks (2407.05291v1)

Published 7 Jul 2024 in cs.AI

Abstract: The ability of LLMs to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recent LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact. To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-LLMs (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace. In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress toward capable autonomous agents. The benchmark can be found at https://github.com/ServiceNow/WorkArena/tree/workarena-plus-plus.

PDF HTML Abstract

An Analysis of WorkArena++: A Benchmark for Autonomous Web Agents

The academic landscape has witnessed increasing interest in the development of autonomous agents powered by LLMs, particularly in the context of automating routine knowledge work. The paper "WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks" introduces a novel benchmark aimed at evaluating the capabilities of web agents in executing a variety of realistic work tasks. Comprising 682 tasks derived from common workflows performed by knowledge workers, WorkArena++ focuses on assessing the planning, problem-solving, logical/arithmetic reasoning, information retrieval, and contextual understanding abilities of these agents.

Core Contributions

The WorkArena++ benchmark stands out due to its significant expansion and enhancement over the previously established WorkArena benchmark. Here are the core contributions highlighted by the authors:

Task Complexity and Scale: The benchmark has expanded from 33 to 682 tasks with increased complexity, introducing composite tasks that demand nuanced skills such as problem-solving and sophisticated memorization. This marks a transition from atomic, simplistic tasks to those that require intricate planning and decision-making, simulating real-world conditions more effectively.
Flexibility and Technical Enhancements: WorkArena++ introduces improvements in visual diversity by incorporating fictitious companies with custom UI designs, enhancing the robustness of evaluations through database isolation and streamlined task compositions, and enabling fine-tuning through the generation of observation-action traces.
Empirical Evaluation and Human Benchmarking: The benchmark conducts comprehensive empirical studies using state-of-the-art LLM and vision LLMs (VLMs), like GPT-3.5, GPT-4o, Llama3, and Mixtral. Despite the models' prowess in existing benchmarks, they face notable challenges on WorkArena++ tasks, which humans can solve relatively easily—demonstrating a stark contrast in performance.

Key Findings and Challenges

The experiments reveal that while humans achieved a success rate of approximately 93.9% on WorkArena++'s tasks, state-of-the-art models struggled significantly, showcasing minimal success. Particularly, current LLMs and VLMs display deficiencies in goal understanding, erroneous assumptions about actions and consequences, and hallucinations such as non-existent actions or task completions. These findings emphasize the benchmark's stringent demands and provide a clear roadmap for future work needed to enhance autonomous agents' capabilities to handle realistic enterprise tasks.

Implications and Future Directions

The implications of WorkArena++ are manifold. Practically, this benchmark sets a higher bar for the deployment of AI-driven web agents in real-world enterprise environments, suggesting that significant improvements are necessary before such agents can handle complex workflows autonomously. Theoretically, the benchmark provokes further exploration of agent design, particularly in areas like compositional planning, contextual reasoning, and better memory mechanisms.

Moving forward, the authors propose further expansion of tasks, potentially incorporating elements of safety and cybersecurity, which are crucial for practical deployment. The ability to extract fine-tuning data presents exciting opportunities for improving training datasets, potentially leading to more robust models.

In conclusion, the WorkArena++ benchmark constitutes an important step toward developing autonomous agents capable of efficiently performing complex web tasks. It offers a valuable framework for driving advances in the design and evaluation of autonomous systems within both academic and industrial settings. The ongoing improvements and expansions anticipated for this benchmark underscore the dynamic nature of this research domain and its pivotal role in shaping the future of AI-driven task automation.