Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 31 tok/s

GPT-5 High 33 tok/s Pro

GPT-4o 100 tok/s

GPT OSS 120B 460 tok/s Pro

Kimi K2 220 tok/s Pro

2000 character limit reached

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries (2508.15760v1)

Published 21 Aug 2025 in cs.CL and cs.AI

Abstract: Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

Collections

Summary

The paper introduces LiveMCP-101, a benchmark with 101 complex tasks to rigorously assess the performance of MCP-enabled agents in dynamic settings.
It employs a dual-agent execution framework and LLM-based scoring to diagnose planning, tool selection, and parameterization errors.
Experimental results highlight significant performance gaps among models, emphasizing challenges in multi-step tool orchestration and adaptive reasoning.

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Introduction and Motivation

The LiveMCP-101 benchmark addresses a critical gap in the evaluation of agentic LLMs operating in dynamic, real-world environments via the Model Context Protocol (MCP). While MCP has standardized tool integration for LLM agents, prior benchmarks have been limited to synthetic, single-step, or static scenarios, failing to capture the complexity and temporal variability inherent in production deployments. LiveMCP-101 introduces 101 rigorously curated tasks spanning web search, file operations, mathematical reasoning, and data analysis, each requiring multi-step, cross-domain tool orchestration. The benchmark construction leverages iterative LLM rewriting and extensive manual review to ensure both complexity and practical solvability.

Figure 1: Construction and evaluation framework of LiveMCP-101, illustrating the dual-agent setup and real-time reference execution for robust benchmarking.

Benchmark Design and Evaluation Methodology

LiveMCP-101 is distinguished by its three-tiered difficulty structure (Easy, Medium, Hard), with tasks averaging 5.4 tool-calling steps and involving up to 41 MCP servers and 260 tools. Each query is paired with a validated ground-truth execution plan, generated via LLMs and refined through human verification, ensuring deterministic reference outputs despite the temporal drift of live MCP services.

The evaluation framework launches two parallel executions per task: a reference agent strictly follows the validated plan, while the test agent autonomously analyzes the query and interacts with a superset of MCP tools, including distractors. This setup enables fine-grained diagnosis of planning, tool selection, parameterization, and output handling errors. Scoring is performed by an LLM-as-a-judge using a 1-5 Likert scale for both final results and execution trajectories, mapped to discrete success rates and quality metrics.

Experimental Results and Model Analysis

Eighteen LLMs were evaluated, including proprietary (OpenAI, Anthropic, Google) and open-source (Qwen3, Llama-3) models. The strongest model, GPT-5, achieved a task success rate (TSR) of 58.42% overall, with only 39.02% TSR on Hard tasks, indicating substantial headroom for improvement in tool orchestration and long-horizon planning. Extended thinking (ET) variants of Anthropic models showed consistent gains, while open-source models lagged significantly, with Llama-3.3-70B-Instruct and Llama-3.1-8B-Instruct nearly failing on Hard tasks.

Performance degrades sharply with increasing task difficulty and tool pool size. Closed-source models exhibit a log-shaped curve in token efficiency: initial token increases yield rapid gains, but further increases plateau, suggesting diminishing returns from verbosity and redundant self-checks. Open-source models fail to convert additional tokens into reliable evidence, reflecting lower token efficiency and planning competence.

Ablation Studies: Iteration Rounds and Tool Pool Size

Ablation studies reveal that increasing the maximum iteration rounds from 15 to 25 improves TSR across all models, but further increases yield negligible gains, indicating that planning quality, not iteration capacity, is the primary bottleneck. Expanding the MCP server pool from 6 to 15 disproportionately degrades performance in weaker and mid-tier models due to increased distractor noise and long-context sensitivity, while top-tier models remain stable.

Figure 2: Ablation paper results showing the impact of iteration rounds and MCP server pool size on TSR and relative performance changes across model tiers.

Error Analysis and Failure Modes

A comprehensive error classification identifies seven failure subtypes across three categories: tool planning/orchestration, parameter errors, and output handling. Semantic errors dominate, even in strong models (16–25%), while syntactic errors are catastrophic for models lacking MCP-specific fine-tuning (e.g., Llama-3.3-70B-Instruct at ~48%). Overconfident self-solving and unproductive thinking are prevalent in mid-tier models, often leading to premature termination or excessive reliance on internal knowledge.

Figure 3: Error classification heatmap across models, decomposing failures into seven fine-grained subtypes and highlighting dominant error patterns.

Evaluation Reliability

Human–LLM agreement studies using Cohen's $\kappa$ demonstrate high reliability of the LLM-as-a-judge framework, with >85% agreement on result evaluations and >78% on trajectory evaluations, supporting the scalability and consistency of automated scoring in agentic benchmarks.

Implications and Future Directions

LiveMCP-101 establishes a rigorous standard for evaluating autonomous agentic capabilities in realistic, temporally evolving environments. The benchmark exposes persistent challenges in multi-step tool orchestration, adaptive reasoning, and token efficiency, even for frontier LLMs. The detailed error taxonomy and ablation insights suggest that future advances will require targeted improvements in planning, tool selection, and MCP-specific schema grounding, as well as more efficient utilization of token budgets.

Practically, the benchmark provides a scalable framework for diagnosing agentic failures and guiding model development. Theoretically, it motivates research into compositional reasoning, robust tool use under distractors, and dynamic adaptation to evolving external services. Future work may explore curriculum learning for tool orchestration, hierarchical agent architectures, and more sophisticated evaluation protocols that account for real-world variability and long-horizon dependencies.

Conclusion

LiveMCP-101 delivers a comprehensive and challenging benchmark for MCP-enabled agents, revealing substantial gaps in current LLM capabilities and providing actionable insights for advancing autonomous tool-augmented AI systems. The benchmark's design, evaluation methodology, and diagnostic framework set a new bar for agentic evaluation, with implications for both practical deployment and foundational research in agentic AI.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (14)

Tweets

https://twitter.com/_akhaliq/status/1959073276937801737

https://twitter.com/jiqizhixin/status/1961262137105654255

https://twitter.com/HuggingPapers/status/1958984394145747317

https://twitter.com/shion_honda/status/1959572188685599159

https://twitter.com/arxivsanitybot/status/1959083602936778775

alphaXiv

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries (27 likes, 0 questions)