TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks (2412.14161v1)

Published 18 Dec 2024 in cs.CL

Abstract: We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in LLMs, there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights LLMs (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.

PDF HTML Abstract

An Evaluation Framework for LLM Agents in Professional Contexts

The paper "TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks" presents a novel evaluation framework intended to rigorously assess the capabilities of AI agents, particularly those powered by LLMs, in performing tasks characteristic of digital work environments. This research is rooted in the escalating integration of AI into workflows across industries, accentuated by the improvements in LLMs that promise to automate a spectrum of work-related tasks. The principal contribution of this work is the introduction of TheAgentCompany, a comprehensive benchmark designed to systematically evaluate AI agents through a series of professional tasks modeled after a digital workplace.

Framework Design and Methodology

TheAgentCompany creates a controlled, reproducible environment that simulates operations within a small software company. This includes access to self-contained internal websites and databases, enabling tasks to be carried out that mirror those of a digital worker. Key activities for agents include browsing the web, coding, executing programs, and engaging in communications with simulated colleagues. This environment is designed to address a notable gap in existing benchmarks, which often lack the complexity or realism needed to authentically assess an AI's efficacy in typical professional settings.

In setting up the environment, the authors prioritize a range of interface interactions that an AI might need in a real-world context, including using Python scripts, web browsers, and chat platforms. Moreover, the benchmark incorporates tasks from various domains such as software engineering, project management, finance, and human resources, thereby providing a diverse set of challenges for AI agents.

Experimental Results and Observations

Experiments conducted with a set of foundational LLMs using the OpenHands agent framework reveal that even the most capable AI models of their time, such as Claude 3.5 Sonnet, can autonomously complete only 24% of tasks. This underscores a significant gap between current AI capabilities and the comprehensive automation of professional tasks. Other models, like Gemini 2.0 Flash and OpenAI's GPT-4o, also illustrate this limitation, with performance notably varied across different task types and domains.

A deep dive into task performance by category highlights discrepancies in AI proficiency across domains. Tasks that involve social interaction or require intricate web navigation, such as those using RocketChat or ownCloud, pose substantial challenges. This suggests that while AI can manage straightforward software engineering tasks with relative competence, it falters in domains requiring nuanced understanding and interaction.

Implications and Future Directions

The outcomes of this research have significant implications both practically and theoretically. Practically, they highlight the limitations of current AI in automating broad swathes of professional tasks, suggesting a need for continued advancements in AI's ability to understand and interact with complex environments and systems. Theoretically, the paper suggests that AI development might benefit from a broader focus beyond traditional coding and software tasks to include skills relevant to administrative and interactive tasks.

The authors outline several future directions, emphasizing the expansion of the benchmark to cover tasks across other industries and more complex creative or conceptual tasks that are not currently well-represented. They also suggest developing tasks with less defined goals to better reflect real-world conditions where task ambiguity is common.

In conclusion, TheAgentCompany provides a vital evaluative tool that underscored the capabilities and limitations of contemporary AI agents, setting a benchmark for future improvements and adaptations. It serves as a crucial point of reference for both academic inquiry and practical development in AI, aiming ultimately for agents that can more fully integrate into and contribute to professional environments.