TheMCPCompany: An Enterprise MCP Benchmark
- TheMCPCompany is a benchmark that tests LLM agents' ability to identify and compose REST API tools using the Model Context Protocol in enterprise-like settings.
- It converts real-world REST APIs into over 18,000 MCP tools, emphasizing retrieval, planning, and multi-step tool orchestration.
- Empirical findings reveal that while advanced models like GPT-5 nearly match ground-truth performance, effectively navigating large tool inventories remains a significant challenge.
TheMCPCompany is a benchmark for evaluating whether a general-purpose LLM agent can solve realistic work tasks by using task-specific tools exposed through the Model Context Protocol (MCP), rather than relying only on browser interaction. It is built from real-world services whose REST APIs are converted into MCP servers, yielding over 18,000 tools, and it includes manually annotated ground-truth tools for each task. The benchmark is designed to study a specific systems question: whether modern agents can discover and compose the right tools in large, enterprise-like environments, and how that regime compares with browser-based interaction and with an oracle setting in which the relevant tools are provided directly (Esfandiarpoor et al., 22 Oct 2025).
1. Concept and scope
TheMCPCompany is framed around a shift in agent design from browser-mediated interaction toward API-backed tool use. In the paper’s formulation, MCP functions as the abstraction layer that converts heterogeneous real-world services into a common tool-calling environment. Rather than asking an agent to navigate web interfaces alone, the benchmark asks whether a general agent can operate over tool inventories derived from service REST APIs, including environments in which the available tool space reaches tens of thousands of tools (Esfandiarpoor et al., 22 Oct 2025).
This design targets a setting that the paper characterizes as enterprise-like. The tasks are meant to resemble realistic workplace objectives spanning multiple services and requiring planning, state tracking, and cross-tool composition. The benchmark therefore studies not only whether an agent can call tools, but whether it can identify and combine the right subset of tools in a large service ecosystem. The paper’s central diagnosis is that the relevant difficulty is no longer merely function calling; it is large-scale tool navigation under realistic service complexity (Esfandiarpoor et al., 22 Oct 2025).
A plausible implication is that TheMCPCompany should be read less as a narrow benchmark for API invocation and more as a benchmark for the combined problem of retrieval, planning, and execution in MCP-mediated environments. That interpretation is reinforced by the paper’s emphasis on the gap between performance with manually annotated ground-truth tools and performance with retrieved tools.
2. Construction and benchmark design
The benchmark is constructed from real-world services rather than synthetic toy APIs. The paper states that service REST APIs are converted into MCP servers, so that each service exposes MCP-callable tools corresponding to API operations. This produces a company-like environment in which an agent interacts with many business-relevant systems through standardized tools (Esfandiarpoor et al., 22 Oct 2025).
A defining design choice is the use of manually annotated ground-truth tools for every task. For each benchmark task, the authors identify the subset of tools that are actually relevant to solving it. These annotations serve two roles. First, they define the ground-truth-tools condition, in which the agent receives only the task-relevant tools. Second, they support retrieval evaluation by checking whether a retrieval system can surface the needed tools from the full catalog. In this sense, the benchmark explicitly separates two capabilities that are often conflated: the ability to execute a workflow once the correct tools are known, and the ability to find those tools in a very large tool universe (Esfandiarpoor et al., 22 Oct 2025).
The exact headline counts for the total number of services and total number of tasks are not stated in the supplied account of the paper. What is stated is that the benchmark uses multiple real-world services, exposes over 18,000 tools, and is deliberately designed to enter the regime of tens of thousands of tools. The paper treats that scale as a defining challenge rather than as incidental background (Esfandiarpoor et al., 22 Oct 2025).
This construction sits within a broader MCP ecosystem that had already become large by 2025. MCPCorpus, for example, reports 13,875 MCP servers and 300 MCP clients in a reproducible ecosystem snapshot, emphasizing the rapid expansion of MCP artifacts and the need for structured analysis (Lin et al., 30 Jun 2025). TheMCPCompany addresses a different layer of that ecosystem: not cataloging MCP artifacts, but evaluating how agents behave when faced with a large MCP tool universe.
3. Tasks, tool access regimes, and retrieval
The paper evaluates three conceptually distinct regimes: browser-based agents, agents using ground-truth tools, and agents using tool retrieval. Browser-based agents interact with services through web interfaces. Ground-truth-tools agents receive the manually annotated task-relevant subset. Retrieval-based agents receive a candidate set selected from the full catalog and must choose and compose tools from that retrieved subset (Esfandiarpoor et al., 22 Oct 2025).
| Setting | Tool access | Purpose |
|---|---|---|
| Browser-based agents | Web/browser interface | Baseline general interaction |
| Ground-truth tools | Manually annotated relevant tools | Upper-bound-style execution setting |
| Tool retrieval | Retrieved candidate set from full catalog | Realistic MCP deployment setting |
The retrieval pipeline is included because the full tool catalog is too large to present directly and because overwhelming the model with many tool descriptions degrades performance. The paper therefore treats retrieval as a required systems component rather than a convenience. Retrieved tools are then presented as the model’s available options. The exact candidate generation algorithm, ranking formula, retriever model, top-, and prompt template details are not visible in the supplied account, although the paper evidently documents retrieval setup elsewhere (Esfandiarpoor et al., 22 Oct 2025).
The benchmark also distinguishes between simpler and more complex environments. According to the paper’s findings, tool-based agents can perform well in simpler settings or under tightly curated tool access, but performance drops sharply as the environment becomes more enterprise-like and the tool inventory expands. This suggests that complexity is driven not only by the number of tools, but by the density of semantically similar operations and the need for multi-step composition across services (Esfandiarpoor et al., 22 Oct 2025).
A plausible implication is that TheMCPCompany operationalizes two coupled bottlenecks. One is retrieval quality: whether the correct tools appear in the candidate set. The other is reasoning quality: whether the model can disambiguate, sequence, and parameterize the retrieved tools correctly once they are available.
4. Empirical findings
The headline result is that tool-based agents perform much better when they are given the right tools, but current agents struggle when they must find those tools among tens of thousands of options. The paper states that, with ground-truth tools, tool-calling agents show the potential for both improving performance and reducing costs assuming perfect tool retrieval. When retrieval is introduced, all models with tool retrieval perform similarly or better than browser-based agents, but smaller models cannot take full advantage of the available tools through retrieval. GPT-5 is highlighted as the strongest reported model, with performance under tool retrieval very close to its performance with ground-truth tools (Esfandiarpoor et al., 22 Oct 2025).
The paper’s broader conclusion is more restrictive than a generic “tools are better than browsers” claim. It states that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. The benchmark therefore reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems remains a challenging task for current models and requires both better reasoning and better retrieval models (Esfandiarpoor et al., 22 Oct 2025).
The reported failure modes are consistent with that interpretation. The paper identifies poor tool discovery, confusion among similar tools, multi-step composition failures, long-horizon planning errors, context overload, and reasoning-retrieval mismatch. In particular, the benchmark rejects the naive assumption that retrieval success is equivalent to task success. Even if the correct tool appears in the retrieved candidate set, the model must still identify it among distractors, infer the correct arguments, decide when to call it, and compose it with other actions (Esfandiarpoor et al., 22 Oct 2025).
This failure profile aligns with independent observations from MCPToolBench++, which also found that schema-level correctness and actual execution correctness can diverge substantially in real MCP environments, especially when live tool reliability and context-window pressure become limiting factors (Fan et al., 11 Aug 2025). TheMCPCompany extends that general diagnosis into an enterprise-scale, retrieval-heavy setting.
5. Relation to the broader MCP ecosystem
TheMCPCompany is situated within a broader research effort that treats MCP as an ecosystem problem rather than merely a protocol specification. MCPCorpus provides a large-scale inventory of MCP servers and clients for ecosystem analysis (Lin et al., 30 Jun 2025). A separate measurement study reports 8,401 valid public MCP projects after filtering a much larger raw marketplace corpus, arguing that public MCP growth is real but inflated by duplication, placeholders, and uneven quality (Guo et al., 29 Sep 2025). Against that backdrop, TheMCPCompany focuses on a different question: what happens when an agent must act inside that ecosystem rather than merely catalog it.
The benchmark also intersects with current work on MCP orchestration and scaling. Context-Aware MCP proposes moving coordination into a Shared Context Store to reduce repeated LLM calls and improve continuity in multi-step tasks, arguing that standard MCP is overly centralized and stateless (Jayanti et al., 6 Jan 2026). From that perspective, TheMCPCompany can be read as measuring a workload regime in which orchestration architecture becomes consequential: large tool catalogs, multi-service tasks, and retrieval-conditioned reasoning all stress the standard LLM-centric control loop.
A further connection concerns MCP service supply. Code2MCP argues that MCP adoption is bottlenecked by the manual effort required to convert existing repositories into MCP services and proposes an automated repository-to-MCP pipeline (Ouyang et al., 7 Sep 2025). If such automation expands the MCP tool supply, benchmarks like TheMCPCompany become more important because the limiting factor shifts from tool availability to tool discoverability and composition under scale.
6. Security, limitations, and significance
TheMCPCompany is primarily a competence and scalability benchmark rather than a security benchmark, but its setting is inseparable from MCP security concerns. The MCP ecosystem has already been shown to be vulnerable to tool-metadata poisoning, where malicious instructions in tool descriptions alter agent behavior before any tool executes (Wang et al., 19 Aug 2025). More broadly, ecosystem-level security studies report weak host verification, mutable tool metadata, registry-level trust problems, and large-scale weaknesses in open-source MCP servers (Li et al., 18 Oct 2025, Kumar et al., 10 Mar 2026). A plausible implication is that any realistic deployment of TheMCPCompany-style agents would need to solve both retrieval and trust: discovering the right tools and distinguishing safe tools from unsafe ones.
The benchmark’s limitations are explicitly acknowledged in the supplied account. The exact total number of services and tasks is not visible there. The retrieval algorithm, ranking formula, top-, and prompt details are also not visible. The paper uses task completion and cost as core outcomes, but the precise automatic scoring function, episode limits, and cost definitions are not available in the supplied text. No explicit equations or retrieval formulas are visible. These omissions do not undermine the benchmark’s central claim, but they limit exact reconstruction of the full experimental protocol from the present account (Esfandiarpoor et al., 22 Oct 2025).
Methodologically, the ground-truth-tools condition should also be interpreted carefully. It isolates execution ability under an oracle-like tool set and is useful for measuring the upper-bound potential of tool-based agents, but it is not the realistic deployment case. The gap between ground-truth tools and retrieval is itself one of the benchmark’s main results. This suggests that the practical barrier to enterprise MCP agents is not merely whether models can use tools, but whether they can navigate a crowded and ambiguous tool ecosystem with sufficient recall and precision.
In the MCP literature, TheMCPCompany’s main significance is diagnostic. It reframes the next stage of agent evaluation around large-scale tool navigation in enterprise-like environments. Its strongest claim is not that browser interaction is obsolete, nor that tool use is solved, but that the critical unresolved problem is how to make general-purpose agents discover and orchestrate task-specific tools at realistic ecosystem scale (Esfandiarpoor et al., 22 Oct 2025).