MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Published 17 Jul 2025 in cs.AI and cs.CL | (2507.12806v1)

Abstract: The rapid rise of LLMs-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper introduces MCPEval, an MCP-based framework that standardizes and automates deep evaluation of LLM-based agent models using a three-stage workflow.
The methodology employs LLM-driven task generation, iterative verification, and multi-dimensional evaluation, integrating tool call analysis with rubric-based LLM judgements.
The evaluation on ten models across five domains reveals performance hierarchies, execution-completion gaps, and validates the framework’s precision in diagnostic insights.

MCPEval: An Automated, MCP-Based Deep Evaluation Framework for AI Agent Models

The paper introduces MCPEval, an open-source, fully automated evaluation framework for LLM-based agent models, leveraging the Model Context Protocol (MCP) to standardize and scale the assessment of agentic capabilities across diverse domains. MCPEval addresses critical limitations in existing evaluation methodologies, particularly the reliance on static benchmarks, manual data collection, and the lack of deep integration with real-world tools and protocols.

Framework Design and Methodology

MCPEval is architected around a three-stage workflow: task generation, task verification, and model evaluation. The system utilizes MCP as the backbone for agent-environment communication, enabling seamless integration with native agent tools and external systems. The task generation process is LLM-driven, producing detailed, tool-specific instructions based on the capabilities exposed by MCP servers. Recognizing the inherent incompleteness of initial LLM-generated tasks, MCPEval incorporates an iterative verification phase, where a frontier agent executes the tasks, refines them upon failure, and establishes high-quality ground truth trajectories.

For model evaluation, MCPEval positions the model-under-test as an MCP client, requiring it to complete the set of verified tasks. The evaluation is multi-dimensional:

Tool Call Analysis: Rigorously compares the agent’s tool usage (name, parameters, order) against ground truth, reporting both strict and flexible matching scores.
LLM Judger Analysis: Employs rubric-based LLM assessment to score planning, execution flow, context awareness, requirement coverage, and output completeness.

This dual-perspective approach enables both granular operational diagnostics and high-level behavioral evaluation, with automated report generation for actionable insights.

Empirical Results

MCPEval is evaluated on ten state-of-the-art models (seven OpenAI, three open-source) across five real-world domains: Healthcare, Airbnb, Sports, National Parks, and Finance. The evaluation encompasses over 10,000 individual tasks and 50 model-domain combinations, representing one of the most comprehensive studies of LLM agent tool-use to date.

Key findings include:

Performance Hierarchy: OpenAI models (notably GPT-4o, GPT-4o-mini, GPT-4.1-mini, O3) consistently outperform open-source models in both tool call precision and LLM-judged output quality. However, smaller models (e.g., o4-mini) can match or exceed larger open models in certain domains, indicating opportunities for cost-effective deployment.
Execution-Completion Gap: All models exhibit a consistent gap between trajectory execution (planning, tool use) and completion quality (output synthesis, usefulness). This gap is more pronounced in open-source models and in domains with complex or less standardized APIs.
Domain Sensitivity: Healthcare and Finance domains yield the highest performance, attributed to standardized APIs and structured data. National Parks and Airbnb present greater challenges, with lower name/parameter match scores and larger execution-completion gaps.
Tool-Use Patterns: Parameter specification is the most common failure mode, and multi-tool coordination remains a significant challenge. Flexible matching criteria reveal that models often approximate correct tool usage but lack precision.
Evaluation Reliability: Strong correlations between tool call metrics and LLM-judged scores validate the framework’s methodology, while aspect-level analysis pinpoints strategic strengths (planning, adaptability) and operational weaknesses (tool usage, output completeness).

Implications and Limitations

MCPEval’s automated, protocol-driven approach enables scalable, reproducible, and fine-grained evaluation of agentic LLMs, facilitating rapid iteration and deployment in production environments. The framework’s ability to generate high-quality, verified trajectories also supports continual model improvement via fine-tuning.

Practical implications include:

Standardization: MCP-based evaluation ensures comparability across models and domains, supporting robust benchmarking and model selection for real-world applications.
Actionable Diagnostics: Fine-grained metrics and automated reports provide developers with targeted feedback for model optimization, particularly in tool-use and output synthesis.
Open-Source Availability: The public release of MCPEval lowers the barrier for reproducible research and industry adoption, accelerating progress in agentic AI.

Limitations:

The reliance on synthetic data and LLM-based judges may not fully capture the complexity of real-world interactions, and introduces potential biases in ground truth generation.
Computational costs for long trajectory evaluation and LLM-based judging can be significant, potentially limiting scalability for large-scale or long-horizon tasks.
The evaluation is sensitive to the reference model used for ground truth, which may penalize models with alternative but valid tool-use strategies.

Theoretical and Future Directions

MCPEval advances the state of agent evaluation by operationalizing protocol-level assessment and automating the end-to-end evaluation pipeline. The observed execution-completion gap highlights a fundamental limitation in current LLM architectures: strong procedural reasoning does not guarantee high-quality output synthesis. This suggests a need for architectural and training innovations targeting output generation, multi-tool coordination, and parameter specification.

Future research directions include:

Incorporating real-world, user-generated task data to complement synthetic benchmarks.
Developing more efficient, cost-effective judging mechanisms, potentially leveraging ensemble or hybrid human-LLM evaluation.
Enhancing verification strategies to reduce bias and improve the reliability of ground truth labels, possibly through cross-validation with multiple sources.
Extending the framework to support multi-agent and multi-modal evaluation scenarios.

Conclusion

MCPEval represents a significant step toward standardized, scalable, and actionable evaluation of LLM-based agents. By leveraging MCP and automating the full evaluation lifecycle, it provides the research and practitioner community with a robust tool for diagnosing, benchmarking, and improving agentic AI systems. The framework’s insights into model and domain-specific performance, as well as its identification of persistent architectural gaps, will inform both the development of next-generation LLM agents and the design of future evaluation methodologies.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

MCPEval: A simple explanation

What is this paper about?

This paper introduces MCPEval, a tool that automatically tests how well AI “agents” (smart programs powered by LLMs) use software tools to complete tasks. It’s built on MCP (Model Context Protocol), a common “language” that helps AIs and tools talk to each other. The goal is to fairly, quickly, and consistently check how different AI agents perform in realistic, tool-using situations—without lots of human work.

What questions does the paper try to answer?

The paper focuses on three big questions explained simply:

Can we build an automatic, end-to-end system that creates, checks, and runs tests for AI agents using real tools?
How can we measure not just whether an AI finished a task, but how well it planned, chose tools, and followed steps?
Do different AI models perform better in different kinds of jobs (domains), and can smaller, cheaper models sometimes keep up with bigger ones?

How does MCPEval work?

Think of this like a driving test for AI agents, but instead of cars, they’re using digital tools (like finance APIs, search tools, or booking systems).

MCPEval runs in three main stages:

Task generation (making the tests)

The system asks an MCP “tool server” what tools are available and what they do.
A “Task-LLM” creates tasks that require using those tools (for example: “Look up a company’s stock price, then summarize it”).
Analogy: It’s like reading the instructions for a toolbox and writing practice challenges that use those tools.

Task verification (checking the tests are fair and doable)

A strong “frontier” AI agent tries the tasks with real tool calls.
If a task is missing info (like a date or ID), the system edits and improves the task until it works.
Successful runs are recorded as “ground truth” routes (the correct way to solve it).
Analogy: Before giving students a test, the teacher solves each problem themselves to make sure it’s solvable and to save an answer key.

Model evaluation (running the driving test)

Each AI model is placed in the driver’s seat as the MCP client and asked to solve the verified tasks.
MCPEval scores performance in two complementary ways:
- Tool Call Matching: Did the model use the right tools, with the right inputs, in a sensible order? There are two versions:
- Strict: “Did it press the exact same buttons in the same order as the answer key?”
- Flexible: “Did it still get the job done even if it pressed slightly different buttons?”
- LLM Judging: A strong AI judge reviews:
- Trajectory (how good was the planning and step-by-step reasoning?)
- Completion (did the final answer meet the user’s needs?)
The system then automatically creates clear reports showing strengths and weaknesses.

MCPEval was tested across five real-world domains:

Finance (stocks and market data)
Airbnb (property info and booking details)
Healthcare (medical info and literature search)
Sports (teams, players, schedules)
National Parks (park info and facilities)

What did the researchers find, and why does it matter?

Here are the main takeaways in plain language:

Bigger isn’t always better: Top models from OpenAI (like GPT-4 variants) generally scored best, but in some cases, smaller models did surprisingly well—especially when using tools effectively. That means you might not always need the most expensive model for good performance.
Two kinds of “good”: Many models were better at following steps (trajectory) than writing great final summaries (completion). In other words, they could “do the work” but sometimes struggled to “wrap it up” perfectly for the user.
Different jobs, different challenges: Some domains (like Healthcare) were easier because the data and tools are clean and structured. Others (like National Parks) were harder due to messy or complex data.
Style vs substance: One model (O3) didn’t always match the “answer key” style of tool usage but still produced excellent final results. That shows that different (but valid) solution paths can work—and it’s important to measure both process and outcome.
Better testing, faster progress: Because MCPEval automatically creates and verifies tasks, researchers can quickly evaluate new tools and models, and even reuse the recorded “good runs” to improve future agents.

What’s the impact of this work?

For developers and companies: MCPEval helps choose the right model for the job, spot weak points (like planning or tool use), and save costs by identifying when a smaller model is good enough.
For researchers: It offers a standardized, open-source way to compare AI agents fairly and reproducibly.
For safety and reliability: It makes it easier to thoroughly test agents before deploying them in real products.
For the community: The framework is open-source, encouraging shared progress and better benchmarks.

A quick word on limitations

The tests use synthetic (made-up but realistic) tasks, which may not capture all real-world messiness.
Using AI judges for long interactions can be costly.
The “answer key” is based on a particular model’s tool-usage style, which can introduce bias if another model solves the task differently but correctly.

Overall, MCPEval is like a smart, automated “exam system” for AI agents that use tools. It doesn’t just check if an agent finished a task—it also watches how it thinks and acts along the way. This helps everyone build better, safer, and more efficient AI.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

GitHub

GitHub - SalesforceAIResearch/MCPEval (3 stars)

Tweets

YouTube

Show All Videos

alphaXiv

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models (30 likes, 0 questions)