Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? (2403.07718v5)

Published 12 Mar 2024 in cs.LG and cs.AI
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Abstract: We study the use of LLM-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

WorkArena and BrowserGym: Benchmarking and Environment for Web Agents in Enterprise Applications

Introduction

The advent of LLMs and their application in generative AI have unlocked potential new frontiers in automated web interaction, opening up avenues for intelligent agents that can navigate and execute tasks on web interfaces. This brings particular attention to the enterprise domain, where the intricacies of software systems often parallel the complexity found in real-life work scenarios. In this context, we introduce WorkArena, a comprehensive benchmark that rigorously tests the proficiency of web agents in accomplishing a spectrum of tasks typically encountered by knowledge workers in enterprise environments. Furthermore, we present BrowserGym, a robust environment designed for the development and evaluation of such agents, offering an extensive array of actions along with multimodal observations.

WorkArena: A New Benchmark for Enterprise Software Automation

WorkArena is crafted around the ServiceNow platform, a cloud-based suite employed for digital workflow automation across various sectors. The benchmark comprises 29 distinct tasks, yielding over 23,150 unique instances, collectively embodying daily operations within enterprise software systems. Specifically, it includes:

  • List operations: Filtering and sorting through data tables,
  • Form interactions: Completing and submitting web forms with varying complexity,
  • Knowledge base navigation: Searching for information and retrieving specific details,
  • Service Catalog navigation: Browsing and ordering items with precise configurations,
  • Menu navigation: Utilizing software menus for application navigation and user impersonation.

This benchmark aims to mimic real-world applications closely, presenting challenges such as dynamic and non-standard user interfaces, extensive document object models (DOMs), and the presence of advanced web technologies like iFrames and shadow DOMs.

BrowserGym: A Versatile Environment for Web Agent Development

BrowserGym stands as a Python-based environment that supports designing and evaluating web agents with features beyond those offered by its predecessors. Key features include:

  • Chat-based user interactions: Allowing for dynamic task instructions and responses,
  • Augmented DOM attributes: Providing each element with unique identifiers, screen coordinates, visibility tags, and bounding boxes,
  • Rich observation space: Including the DOM, an accessibility tree, viewport screenshots, and error messages for comprehensive understanding,
  • Flexible action space: Facilitating interactions through Python code, high-level primitives, and coordinate-based inputs,
  • Multi-page navigation support: Ensuring compatibility with complex web applications.

By accommodating a vast range of observations and actions, BrowserGym enables experimentation with diverse agent architectures, including text-only, vision-augmented, memory-augmented agents, among others. Furthermore, the environment's flexibility minimizes the effort required to create new benchmarks or port existing ones, demonstrated by the seamless integration of MiniWoB within BrowserGym.

Empirical Evaluation and Insights

Our empirical analysis reveals that WorkArena poses a significant challenge for current web agents, highlighting a substantial gap in achieving full task automation. The evaluation of state-of-the-art LLMs, including GPT-4, GPT-3.5, and CodeLlama, indicates promising capabilities in simpler benchmarks like MiniWoB but underscores the need for further advancement in handling complex enterprise tasks. Interestingly, the experimental outcomes also emphasize the pivotal role of model size and architecture, with larger models like GPT-4 showing notably better performance.

BrowserGym's expansive feature set proved beneficial in enhancing agent performance across various scenarios. However, in the context of WorkArena, certain features, particularly those relating to 2D interactions, did not significantly contribute to agent performance, possibly due to the benchmark's design focusing on tasks solvable through high-level actions.

Future Directions

The development and implementation of WorkArena and BrowserGym mark significant strides in exploring the capabilities of web agents within enterprise software systems. Looking forward, integrating additional benchmarks and expanding the task diversity in WorkArena can provide further insights into the emergent properties of AI systems and their potential impact on automating knowledge work. This could eventually lead to the realization of highly efficient, intelligent agents capable of augmenting human productivity in the digital field.

Conclusion

WorkArena and BrowserGym represent important contributions to the field of LLMs and generative AI in the context of web automation. By offering a realistic and challenging benchmark coupled with a versatile testing environment, they set the stage for advancing our understanding and capabilities in automating complex, real-world tasks encountered in enterprise applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. The unsolved challenges of LLMs in open-ended web tasks: A case study. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id=jt3il4fC5B.
  2. OpenAI gym, 2016.
  3. Mind2Web: Towards a generalist agent for the web. arXiv, abs/2306.06070, 2023.
  4. Multimodal web navigation with instruction-finetuned foundation models. arXiv, abs/2305.11854, 2023. URL https://arxiv.org/abs/2305.11854.
  5. Google. Chrome devtools protocol, 2023. URL https://chromedevtools.github.io/devtools-protocol/.
  6. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023a.
  7. A real-world WebAgent with planning, long context understanding, and program synthesis. arXiv, abs/2307.12856, 2023b. URL https://arxiv.org/abs/2307.12856.
  8. WebVoyager: Building an end-to-end web agent with large multimodal models. arXiv, abs/2401.13919, 2024. URL https://arxiv.org/abs/2401.13919.
  9. A data-driven approach for learning to control computers. In International Conference on Machine Learning (ICML), 2022.
  10. Language models can solve computer tasks. arXiv, abs/2303.17491, 2023. URL https://arxiv.org/abs/2303.17491.
  11. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018.
  12. AgentBench: Evaluating LLMs as agents. arXiv, abs/2308.03688, 2023a. URL https://arxiv.org/abs/2308.03688.
  13. BOLAA: Benchmarking and orchestrating LLM-augmented autonomous agents. arXiv, abs/2308.05960, 2023b.
  14. Maas, M. Knowledge 2020: “The digital workflow revolution has just begun”. Technical report, Sprinklr, 2020. URL https://www.linkedin.com/pulse/knowledge-2020-digital-workflow-revolution-has-just-begun-maas/.
  15. Mastantuono, G. ServiceNow joins the prestigious Fortune 500 list. https://www.servicenow.com/blogs/2023/servicenow-joins-fortune-500-list.html, 2023. Accessed: 2024-01-29.
  16. Microsoft. Playwright for Python documentation, 2023. URL https://playwright.dev/python/.
  17. WebGPT: Browser-assisted question-answering with human feedback. arXiv, abs/2112.09332, 2021. URL https://arxiv.org/abs/2112.09332.
  18. OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
  19. SAE. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles. Technical report, Society of Automotive Engineers (SAE), 04 2021. URL https://doi.org/10.4271/J3016_202104.
  20. ServiceNow. Vancouver release notes. Online, 2023. Available at: https://docs.servicenow.com/bundle/vancouver-release-notes/.
  21. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning (ICML), 2017a.
  22. World of bits: An open-domain platform for web-based agents. ICML, 2017b.
  23. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
  24. van der Meer, J. A journey into the future of the translation industry, 2021. URL https://www.taus.net/resources/blog/a-journey-into-the-future-of-the-translation-industry. Accessed: 2024-02-01.
  25. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  26. Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  24824–24837. Curran Associates, Inc., 2022b. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
  27. WebShop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  28. ReAct: Synergizing reasoning and acting in language models. arXiv, abs/2210.03629, 2023. URL https://arxiv.org/abs/2210.03629.
  29. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
  30. Webarena: A realistic web environment for building autonomous agents. ArXiv, abs/2307.13854, 2023. URL https://arxiv.org/abs/2307.13854.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Alexandre Drouin (34 papers)
  2. Maxime Gasse (18 papers)
  3. Massimo Caccia (28 papers)
  4. Issam H. Laradji (21 papers)
  5. Manuel Del Verme (3 papers)
  6. Tom Marty (5 papers)
  7. Léo Boisvert (4 papers)
  8. Megh Thakkar (12 papers)
  9. Quentin Cappart (25 papers)
  10. David Vazquez (73 papers)
  11. Nicolas Chapados (25 papers)
  12. Alexandre Lacoste (42 papers)
Citations (30)
Youtube Logo Streamline Icon: https://streamlinehq.com