TheAgentCompany Benchmark: Evaluating LLM Agents

Updated 7 March 2026

TheAgentCompany Benchmark is an extensible framework that rigorously evaluates LLM-based agents on authentic, long-horizon digital tasks.
It employs diverse simulation environments and realistic workplace tools to test capabilities across software engineering, finance, and other domains.
The framework offers fine-grained metrics and systematic architectural ablations to guide enterprise adoption and refine agent designs.

TheAgentCompany Benchmark is an extensible, enterprise-oriented framework for evaluating LLM agents on consequential, digitally mediated real-world tasks. Designed to rigorously assess agent autonomy and reliability in complex multi-role workplace settings, it supplies comprehensive simulation environments, diverse task suites, and fine-grained quantitative metrics, with recent research extending its applicability to finance workflows and systematic agent architecture ablations (Xu et al., 2024, Milsom, 1 Dec 2025, Bogavelli et al., 13 Sep 2025).

1. Motivation and Benchmark Scope

The primary motivation for TheAgentCompany Benchmark is to provide a robust measure of LLM-based agent capability on professional tasks that closely model digital work in industry settings. It addresses critical gaps in prior benchmarks by evaluating agents on authentic, long-horizon tasks in a sandboxed software enterprise, requiring interactions that mirror those of human digital workers—namely, web browsing, code authoring, program execution, and multi-channel communication (Xu et al., 2024).

The benchmark informs both industry adoption strategies (by clarifying task categories currently automatable by agents) and economic policy (by providing empirical evidence on which workplace competencies remain un-automatable). It also enables systematic study of architectural dimensions influencing multi-agent reliability, cost, and workflow success (Bogavelli et al., 13 Sep 2025).

2. Simulation Environment and Task Construction

TheAgentCompany instantiates a synthetic startup providing graph databases, streaming systems, and web-scale applications, realized as a self-hosted Docker sandbox. The platform comprises:

Workspace Tools: Chromium browser via Playwright, code editors, bash shell, and Jupyter IPython runtime.
Intranet Services: GitLab (version control, Wiki), OwnCloud (document repository), Plane (issue tracking), RocketChat (chat, messaging).
Simulated Colleagues (NPCs): Role-played via LLMs (Claude-3.5-Sonnet, Sotopia), providing heterogenous agent interactions as CTOs, engineers, managers, and domain specialists.

Task design emphasizes:

Professional domain breadth (Software Engineering, Product/Project Management, Data Science, Finance, Admin/HR).
Human-like interactions—many tasks require agents to seek information, clarification, or approvals from NPCs.
Long-horizon, partially graded tasks, with checkpoints for intermediate progress or partial solutions.
Multiple interaction modalities: browser automation, shell/CLI operations, code execution, and chat.

All services are deployed as reproducible, open-source containers pre-populated with realistic project artifacts and datasets. Synthetic data generation and deterministic task evaluators ensure test consistency and auditability (Xu et al., 2024, Milsom, 1 Dec 2025).

3. Agent Architectures and Benchmark Dimensions

TheAgentCompany enables systematic exploration of agent architectures. Supported frameworks include:

OpenHands CodeAct + Browsing: Single-agent, ReAct-style prompt architectures maintaining full task history, with LLM-invoked toolsets (BrowserGym, bash, IPython) (Xu et al., 2024).
OWL-RolePlay: Multi-agent configurations, featuring dedicated planner, browser, and coder agents communicating via shared memory.

Recent architectural ablations extend this with 18 configurations across four design axes (Bogavelli et al., 13 Sep 2025):

Orchestration: Open agent network, isolated agents (via central orchestrator), or single generalist agent.
Prompt Implementation: ReAct (“thought” trace emission) vs. FunctionCalling (direct tool APIs).
Memory: Complete (full trace) or summarized (concise context) memory.
Thinking Tools: Injected math engines/synthesis tools (no-op, extensible reasoning aids).

This systematic combinatorial design enables fine-grained measurement of how agent system-level choices impact task outcomes across diverse LLMs.

Orchestration	Prompt Type	Memory	Thinking Tools
Open / Isolated / SingleAgent	ReAct / FunctionCalling	Complete / Summarized	Enabled / Disabled

4. Evaluation Methodology and Metrics

Agent performance is measured via task-level, checkpoint-based automated graders with the following core metrics (Xu et al., 2024, Milsom, 1 Dec 2025, Bogavelli et al., 13 Sep 2025):

Full Completion Score:

$S_{\text{full}} = \begin{cases} 1 & \text{if all checkpoints passed},\ 0 & \text{otherwise} \end{cases}$

Partial Completion Score:

$S_{\text{partial}} = 0.5 \cdot \frac{\mathrm{Result}}{\mathrm{Total}} + 0.5 \cdot S_{\text{full}}$

where $\mathrm{Result}$ is the sum of awarded checkpoint points, $\mathrm{Total}$ the sum of possible checkpoint points.

Steps: Total LLM API calls per episode.
Cost:

$\mathrm{Cost} = (\mathrm{PromptTokens} \times \mathrm{PromptTokenCost}) + (\mathrm{CompletionTokens} \times \mathrm{CompletionTokenCost})$

Domain-specific benchmarks (e.g., wealth management) track accuracy,

$\text{Accuracy} = \frac{ \sum_{t=1}^T \sum_{i=1}^{n_t} w_{t,i}\,\mathbf{1}_{t,i} }{ \sum_{t=1}^T \sum_{i=1}^{n_t} w_{t,i} }$

success rate,

$\text{SuccessRate} = \frac{|\{t : \text{all checkpoints passed}\}|}{T}$

and cost–accuracy Pareto performance.

Additional AgentArch-specific metrics include Acceptable Rate (requiring correct tool choice, argument, and outcome), pass@1, pass $^k$ , TTC (time-to-completion), and tool error rates (Bogavelli et al., 13 Sep 2025).

5. Empirical Results and Comparative Insights

In the original TheAgentCompany suite (175 tasks), top closed-API agents such as Gemini-2.5-Pro and Claude-3.7-Sonnet complete up to 30% of tasks fully autonomously (partial completion ~40%); open-weights models lag significantly (≤7.4% success) (Xu et al., 2024). Category granularity reveals pronounced variation:

Category	Success (Gemini-2.5-Pro)	Partial
SWE	37.7%	45.1%
PM	39.3%	52.6%
DS	14.3%	20.1%
Admin	13.3%	19.2%
HR	34.5%	45.0%
Finance	8.3%	21.6%

Platform-specific performance also demonstrates difficulty gradients, with GitLab and Plane (structured, code/project tasks) scoring highest, and document-centric (OwnCloud) and social (RocketChat) tasks showing lowest agent efficacy.

AgentArch (Bogavelli et al., 13 Sep 2025) systematically compares 18 agent configurations across six LLMs and two enterprise “use cases” (Time Off, Customer Routing). On the simple Time Off task, the best configuration (SingleAgent, FunctionCalling, SummarizedMemory, ThinkingTools enabled) yields 70.8% success (GPT-4.1), but for the complex Customer Routing task, the highest score is 35.3% (Claude Sonnet 4). FunctionCalling generally surpasses ReAct, multi-agent orchestration is crucial for complex logic, and thinking tools notably aid non-reasoning models. Loop failure and hallucination rates are substantially higher in ReAct and multi-agent settings.

Wealth-management extensions (Milsom, 1 Dec 2025) demonstrate substantial gains in accuracy (up to 69.4% on checkpointed analysis tasks under low-autonomy prompts), but highlight persistent brittleness in integrating cross-tool delivery, authentication, and domain-specific reasoning.

6. Representative Failure Modes and Design Implications

Key observed failure cases include:

Difficulty in robust multi-turn dialogue and context tracking with human-simulated NPCs.
Failure with dynamic/complex web interfaces (forms, pop-ups, document parsing).
Looping in browsing or tool invocation, skipping substeps, and misunderstanding role responsibilities.
Delivery failures in cross-platform workflows (e.g., incomplete RocketChat messaging, incorrect file uploads).
High variability in performance depending on task prompt specificity (high vs. low autonomy).

This suggests that agent-centric workflow augmentation is currently viable primarily for well-scoped, repetitive digital chores, but not for ambiguous, high-stakes, or non-structured tasks. Mixed human–AI workflows remain the recommended model for workplace adoption, leveraging agent strengths for code and routine analytics while reserving social and open-ended activities for humans.

7. Limitations, Open-Source Artifacts, and Prospects

Released under an open-source license, the benchmark provides all Docker configurations, task specifications, evaluator scripts, and agent baselines at https://the-agent-company.com. Across its versions and domain extensions, several limitations are noted:

No human baseline (for comparative speed/reliability/cost).
Closed synthetic environments (file, schema, and encoding variability limited).
Task focus on scoped administrative and engineering workflows, with little coverage of creative, exploratory, or open-ended project work.

Proposed extensions include broader domain coverage (industries, physical/embodied settings), handling of ambiguous intent, GUI-based software benchmarks, multi-agent planner–executor handoffs, and systematic robustness testing under controlled data perturbations (Milsom, 1 Dec 2025). Research continues into optimal orchestration, memory, and prompt design, with TheAgentCompany Benchmark positioned as a reference for enterprise-grade agent system evaluation and comparison (Bogavelli et al., 13 Sep 2025).

References

(Xu et al., 2024) "TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks"
(Milsom, 1 Dec 2025) "Benchmarking LLM Agents for Wealth-Management Workflows"
(Bogavelli et al., 13 Sep 2025) "AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise"