LiveMCPBenchmark Evaluation

Updated 29 November 2025

LiveMCPBenchmark is a standardized evaluation framework that measures LLM agent planning, retrieval, and execution across diverse, real-world MCP environments.
It formalizes tasks as POMDPs and employs objective metrics such as success rate, efficiency, and agreement to benchmark multi-step tool orchestration.
The framework underpins reproducible research with open-source infrastructure, dynamic tool integration, and comprehensive error analysis.

LiveMCPBenchmark is a standardized evaluation framework for LLM agents that orchestrate and invoke services via the Model Context Protocol (MCP) across diverse, real-world environments. LiveMCPBenchmark enables rigorous measurement of agentic planning, tool retrieval, execution, and robustness within ecosystems of hundreds of MCP servers and thousands of tools. The benchmark serves as foundational infrastructure for reproducible, scalable research into agent capabilities, security, orchestration, and resilience—all using authentic, production APIs and tasks rooted in real multi-step application scenarios (Mo et al., 3 Aug 2025, Nizar et al., 22 Nov 2025).

1. Formal Task Model and Benchmark Definition

LiveMCPBenchmark conceptualizes the MCP tool-calling environment as a Partially-Observable Markov Decision Process (POMDP) with state space $\mathcal{S}$ (hidden world states), action set $\mathcal{A}$ (including routing, execution, and response), observation space $\Omega$ (tool descriptions and outputs), transition function $\mathcal{T}$ , and reward function $\mathcal{R}$ encoding successful task completion. User tasks are formalized as tuples $T = \langle S_0, \mathcal{A}, \Phi, P \rangle$ specifying initial state, available servers, and required key points. Agents must navigate uncertain, dynamic tool distributions and plan multi-hop trajectories to satisfy all annotated criteria (Mo et al., 3 Aug 2025).

The LiveMCPBenchmark suite comprises 95 real-world tasks distributed across domains including Office, Lifestyle, Leisure, Finance, Travel, and Shopping. Each task requires coordinated tool use—often spanning multiple servers—and is annotated by experts for authenticity, multi-step complexity, and key-point verification. Typical workflows involve two to four tool calls across one to two servers per task.

2. Dataset and MCP Ecosystem Composition

LiveMCPBenchmark builds on the MCP ecosystem with a curated LiveMCPTool suite of 70 MCP servers and 527 tools, drawn from >5,500 public registries and filtered for deployability (no proprietary dependencies). Tool metadata are standardized as:

Field	Example Value	Purpose
server_name	"mcp_finance"	MCP server identifier
tool_name	"get_stock_price"	Unique tool name within server
input_schema	JSON fields	Parameter types
output_schema	JSON fields	Response structure
categories	["Finance"]	Top-level taxonomy label

This schema enables retrieval, orchestration, and plug-and-play integration. Categories cover Discovery, Visualization, File Access, Code, Entertainment, Finance, Location, and Miscellaneous (Mo et al., 3 Aug 2025, Nizar et al., 22 Nov 2025).

3. Evaluation Methodology and Metrics

LiveMCPBenchmark employs LiveMCPEval, an "LLM-as-a-Judge" pipeline for automated, adaptive evaluation in dynamic environments. The judge receives the task description, key-point annotation, agent trajectory (tool call–response pairs), and tool metadata, then issues a binary success/failure verdict by mapping outputs to required criteria. For time-varying tasks, evaluation is replayed within controlled windows to ensure consistency.

Principal metrics are:

Success Rate (SR):

$\mathrm{SR} = \frac{\#\text{tasks judged Success}}{\text{Total tasks}}$

Efficiency: Number of dialogue turns, distinct tools used, tool executions, and retrieval calls per successful run.
Agreement: Human/LLM judge concordance, reported at 81% (DeepSeek-V3 as judge) (Mo et al., 3 Aug 2025).

Error taxonomy includes query error, retrieve error, tool error (bad parameters), and other system exceptions.

4. Agent Architectures and Planning Algorithms

The MCP Copilot Agent executes dynamic planning as a POMDP, maintaining state belief, retrieving candidate tools, executing API calls, updating beliefs, and terminating upon goal satisfaction. Tool retrieval employs similarity scoring over server descriptions and tool-specific documentation, typically via weighted reciprocal rank fusion (wRRF):

$\mathrm{score}(s_i,t_j) = \alpha \cdot \mathrm{sim}(\mathrm{desc}(s_i),q) + \beta \cdot \mathrm{sim}(\mathrm{desc}(t_j),q)$

Agents interleave routing, execution, and result-processing steps, with memory of previous actions. Retrieval-augmented architectures ("Agent-as-a-Graph") connect tools and servers as joint nodes in a knowledge graph and rerank with type-specific wRRF, achieving state-of-the-art Recall@5 and nDCG@5 (Nizar et al., 22 Nov 2025).

5. Quantitative Performance and Key Findings

Experiments on ten leading models under uniform temperature and prompt templates (Claude-Sonnet-4, GPT-4.1, DeepSeek-V3, Gemini-2.5-Pro, Qwen series) reveal significant performance variance, with overall success rates:

Model	Success Rate (%)
Claude-Sonnet-4	78.95
Claude-Opus-4	70.53
DeepSeek-R1	48.42
Qwen3-235B	48.42
GPT-4.1-Mini	44.21
Qwen2.5-72B	43.16
DeepSeek-V3	42.11
Gemini-2.5-Pro	41.05
GPT-4.1	38.95
Qwen3-32B	30.53

Model performance correlates with retrieval effectiveness and compositional planning: retrieval errors comprise ~45% of failures in Claude experiments, tool invocation errors another ~25%. Pareto analysis demonstrates linear trade-offs between interaction cost (dialogue turns) and performance.

Agent-as-a-Graph improves Recall@5 from 0.74 (ScaleMCP) to 0.85 and nDCG@5 from 0.40 to 0.47 with optimal agent/tool weighting (Nizar et al., 22 Nov 2025). Models under-utilize tool diversity and exhibit bottlenecks in hierarchical semantic retrieval.

6. Limitations, Variants, and Directions for Improvement

LiveMCPBenchmark is constrained by its reliance on LLM-judge evaluation, subject to possible trajectory-length bias and mismatches between tool description and runtime behavior. Proposed improvements involve hybrid evaluation (combining LLM assessments with executable validation scripts), knowledge graph retrieval augmentation, and built-in agent retry/fallback strategies.

Security and robustness benchmarks (MCPTox, MSB) reveal systemic vulnerabilities: tool poisoning and prompt injection within tool metadata frequently bypass alignment safeguards (Wang et al., 19 Aug 2025, Zhang et al., 14 Oct 2025). These findings motivate protocol enhancements such as metadata attestation, static description analysis, and permission enforcement.

Extensibility recommendations include:

Continuous ingestion of new MCP servers and tools
Automated schema-drift detection and validation
Integration of user feedback and objective scoring for longitudinal studies
Support for multilingual and multimodal agent queries

MSC-Bench demonstrates objective, curriculum-based evaluation beyond LLM-as-a-judge, reporting precision, recall, F1, and exact-match across five orchestration levels (Dong et al., 22 Oct 2025).

7. Significance and Impact for Agentic Research

LiveMCPBenchmark establishes a reproducible, scalable template for evaluating agentic tool-use in MCP-enabled environments. Its systematic task design, data-driven tool registry, objective metrics, and transparent error analyses enable rigorous benchmarking across planning, execution, and retrieval dimensions—facilitating progress in orchestration, robustness, and security for multi-server, cross-domain agentic systems. The benchmark’s open-source infrastructure (https://icip-cas.github.io/LiveMCPBench) underpins collaborative development of next-generation LLM agents, retrieval architectures, and standardized protocols in live, dynamic computational ecosystems (Mo et al., 3 Aug 2025, Nizar et al., 22 Nov 2025).