MCPEval: Evaluation Framework for LLM Agents

Updated 8 September 2025

MCPEval is an open-source, automated evaluation framework designed to assess LLM-based agents using dynamic task generation, iterative verification, and dual-perspective analysis.
It leverages the Model Context Protocol (MCP) to automatically generate task instructions, verify execution, and compare agent trajectories against established ground truths.
Empirical evaluations across diverse domains demonstrate its ability to standardize benchmarking with configurable metrics and minimal human intervention.

MCPEval is an open-source, automated evaluation framework designed to assess LLM-based intelligent agents in tool-augmented, interactive environments by leveraging the Model Context Protocol (MCP). MCPEval combines automated task generation, iterative task verification, and dual-perspective evaluation—incorporating both granular tool call matching and rubric-based LLM judging—to provide standardized, reproducible, and domain-flexible benchmarking of agent capabilities across diverse domains. Its architecture facilitates seamless integration with native agent tools, enabling rapid, large-scale assessment with minimal manual effort (Liu et al., 17 Jul 2025).

1. System Architecture and Process

MCPEval organizes its evaluation workflow into three principal stages:

Task Generation: The system queries an MCP server to obtain tool specifications, which serve as the context for the agent’s operational environment. A dedicated Task-LLM uses this context to generate detailed, parameterized task instructions that specify the required sequence of tool calls and their arguments.
Task Verification: A “frontier agent” (an MCP client) executes the generated tasks by interacting with the MCP server. If the initial task fails due to missing or malformed tool parameters, the agent triggers an update process, refining the task instructions until successful execution is achieved. This verification yields confirmed task specifications and a ground truth agent trajectory.
Evaluation and Reporting: The model-under-test, acting as an MCP client, is presented with the verified tasks. Its execution (tool call names, parameters, sequencing) is captured and compared to the ground truth using strict and flexible matching schemes. Additionally, a rubric-based LLM acts as a judge, scoring performance on planning quality, execution flow, context sensitivity, and requirement satisfaction.

This automated, multi-phase process distinguishes MCPEval from traditional static benchmarks by dynamically capturing task complexity and verifying agent behavior in realistic, interactive settings.

2. Task Generation and Verification

Automated task synthesis in MCPEval begins with an MCP server call to enumerate available tool APIs and their specifications. The Task-LLM conditions on this output to generate procedural task descriptions that reflect the tool ecosystem. Because auto-generated instructions may be incomplete, the framework deploys a “frontier agent” to execute these tasks, catching failures such as invalid parameters, tool unavailability, or misordered calls. The verification process iterates—reissuing improved task instructions—until all elements are executable, thereby constructing explicit ground truth trajectories for subsequent evaluation.

This iterative verification pipeline ensures that:

All tasks are feasible in the target MCP environment.
Generated agent trajectories are grounded in actual tool use and parameterization, rather than synthetic or idealized agent logic.
Human effort in scenario engineering and error correction is minimized.

3. Standardized Evaluation Metrics

MCPEval introduces domain-agnostic, standardized metrics to facilitate uniform benchmarking:

Tool Call Matching:

Name Match Score: Assesses the exactness or approximate similarity (with configurable thresholds) between the agent’s tool names and the ground truth.
Parameter Match Score: Quantifies accuracy or similarity of provided tool parameters, with flexible similarity thresholds (e.g., ≥0.6).
Order Match Score: Captures alignment of tool call sequences, supporting partial credit with thresholds (e.g., ≥0.5).

The composite metric is:

$\text{Overall Score} = 0.4\ \text{(Name Match)} + 0.4\ \text{(Parameter Match)} + 0.2\ \text{(Order Match)}$

where the weighting reflects the relative importance of each aspect in agent task completion fidelity.

LLM Judger Analysis:

Trajectory Score: Evaluates reasoning and procedural coherence in the sequence of tool invocations.
Completion Score: Measures output correctness, requirement coverage, completeness, and usefulness, as judged by an expert-like LLM according to a rubric.

This dual-analysis framework enables both micro-level (API contract adherence) and macro-level (holistic task outcome) assessment.

4. Empirical Evaluation and Results

MCPEval was applied to ten models (including open- and closed-source LLMs) across five real-world domains: Healthcare, Finance, Airbnb, Sports, and National Parks. In these experiments:

Healthcare domain tasks exhibited higher tool call and trajectory scores, attributable to regulated, well-defined APIs.
Airbnb tasks demonstrated the largest execution-completion gaps; models could often initiate correct API interactions yet failed to produce comprehensive recommendations in output synthesis.
Across Finance, Sports, and National Parks, performance varied according to API diversity and lexical complexity, with some smaller, tool-augmented models excelling on specific sub-tasks.
Closed-source GPT-4 variants (notably GPT-4o) led on both strict and flexible tool call matching and in LLM judger metrics, but several lightweight models exhibited competitive strengths in specialized domains.

A repeated observation was that most models achieved higher trajectory than completion scores, indicating greater success in procedural logic than in final output composition.

5. Technical Innovations

Key technical contributions of MCPEval include:

Automated Ground Truth Construction:

Dynamic task generation and feedback-driven verification eliminate the need for manual, static benchmark creation, ensuring currency and relevance in diverse application environments.

Dual Evaluation Perspective:

Integration of both deterministic (API signature matching) and holistic (LLM rubric scoring) assessment captures nuanced failures—such as correct procedural logic but incomplete final answers—unobservable with static correctness checks.

Flexible Thresholds and Weighting:

Configurable similarity thresholds and weighted scoring promote applicability across domains with heterogeneous tooling and API signatures, while supporting partial credit for near-matches.

Automated Aggregation and Reporting:

Full reporting pipelines automatically synthesize detailed performance breakdowns at both the overall and per-domain level, plus error analyses that identify lexicon, parameter, or procedural failure modes.

6. Open Source Release and Reproducibility

The MCPEval framework is publicly available (Liu et al., 17 Jul 2025), supporting community adoption and iterative refinement. Open sourcing advances:

Reproducibility: Researchers can replicate experimental protocols and outcomes.
Comparability: Benchmarking results across new models and domains is simplified through standardized scoring.
Extension: Integration of additional domains or tool environments is enabled by MCP-interoperable task and API specification.

This open approach is deemed vital for establishing community-wide standards in the evaluation of increasingly complex, tool-augmented LLM agents.

7. Impact and Future Trajectories

MCPEval represents a significant step toward robust, scalable, and meaningful evaluation of LLM-based AI agents operating in real-world, tool-supported scenarios. By automating both the scenario generation and evaluation process, and standardizing metrics for nuanced agent behavior, MCPEval enables deep analysis, rapid iteration, and reliable comparison. The framework’s design further supports future evolution to accommodate new tool types, interactive modes, and richer output quality assessments—addressing the demands of an evolving landscape in large-scale, tool-augmented conversational agents.

PDF Markdown Chat (Pro)

References (1)

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MCPEval.