MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
Abstract: The rapid rise of LLMs-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
MCPEval: A simple explanation
What is this paper about?
This paper introduces MCPEval, a tool that automatically tests how well AI “agents” (smart programs powered by LLMs) use software tools to complete tasks. It’s built on MCP (Model Context Protocol), a common “language” that helps AIs and tools talk to each other. The goal is to fairly, quickly, and consistently check how different AI agents perform in realistic, tool-using situations—without lots of human work.
What questions does the paper try to answer?
The paper focuses on three big questions explained simply:
- Can we build an automatic, end-to-end system that creates, checks, and runs tests for AI agents using real tools?
- How can we measure not just whether an AI finished a task, but how well it planned, chose tools, and followed steps?
- Do different AI models perform better in different kinds of jobs (domains), and can smaller, cheaper models sometimes keep up with bigger ones?
How does MCPEval work?
Think of this like a driving test for AI agents, but instead of cars, they’re using digital tools (like finance APIs, search tools, or booking systems).
MCPEval runs in three main stages:
- Task generation (making the tests)
- The system asks an MCP “tool server” what tools are available and what they do.
- A “Task-LLM” creates tasks that require using those tools (for example: “Look up a company’s stock price, then summarize it”).
- Analogy: It’s like reading the instructions for a toolbox and writing practice challenges that use those tools.
- Task verification (checking the tests are fair and doable)
- A strong “frontier” AI agent tries the tasks with real tool calls.
- If a task is missing info (like a date or ID), the system edits and improves the task until it works.
- Successful runs are recorded as “ground truth” routes (the correct way to solve it).
- Analogy: Before giving students a test, the teacher solves each problem themselves to make sure it’s solvable and to save an answer key.
- Model evaluation (running the driving test)
- Each AI model is placed in the driver’s seat as the MCP client and asked to solve the verified tasks.
- MCPEval scores performance in two complementary ways:
- Tool Call Matching: Did the model use the right tools, with the right inputs, in a sensible order? There are two versions:
- Strict: “Did it press the exact same buttons in the same order as the answer key?”
- Flexible: “Did it still get the job done even if it pressed slightly different buttons?”
- LLM Judging: A strong AI judge reviews:
- Trajectory (how good was the planning and step-by-step reasoning?)
- Completion (did the final answer meet the user’s needs?)
- The system then automatically creates clear reports showing strengths and weaknesses.
MCPEval was tested across five real-world domains:
- Finance (stocks and market data)
- Airbnb (property info and booking details)
- Healthcare (medical info and literature search)
- Sports (teams, players, schedules)
- National Parks (park info and facilities)
What did the researchers find, and why does it matter?
Here are the main takeaways in plain language:
- Bigger isn’t always better: Top models from OpenAI (like GPT-4 variants) generally scored best, but in some cases, smaller models did surprisingly well—especially when using tools effectively. That means you might not always need the most expensive model for good performance.
- Two kinds of “good”: Many models were better at following steps (trajectory) than writing great final summaries (completion). In other words, they could “do the work” but sometimes struggled to “wrap it up” perfectly for the user.
- Different jobs, different challenges: Some domains (like Healthcare) were easier because the data and tools are clean and structured. Others (like National Parks) were harder due to messy or complex data.
- Style vs substance: One model (O3) didn’t always match the “answer key” style of tool usage but still produced excellent final results. That shows that different (but valid) solution paths can work—and it’s important to measure both process and outcome.
- Better testing, faster progress: Because MCPEval automatically creates and verifies tasks, researchers can quickly evaluate new tools and models, and even reuse the recorded “good runs” to improve future agents.
What’s the impact of this work?
- For developers and companies: MCPEval helps choose the right model for the job, spot weak points (like planning or tool use), and save costs by identifying when a smaller model is good enough.
- For researchers: It offers a standardized, open-source way to compare AI agents fairly and reproducibly.
- For safety and reliability: It makes it easier to thoroughly test agents before deploying them in real products.
- For the community: The framework is open-source, encouraging shared progress and better benchmarks.
A quick word on limitations
- The tests use synthetic (made-up but realistic) tasks, which may not capture all real-world messiness.
- Using AI judges for long interactions can be costly.
- The “answer key” is based on a particular model’s tool-usage style, which can introduce bias if another model solves the task differently but correctly.
Overall, MCPEval is like a smart, automated “exam system” for AI agents that use tools. It doesn’t just check if an agent finished a task—it also watches how it thinks and acts along the way. This helps everyone build better, safer, and more efficient AI.
Collections
Sign up for free to add this paper to one or more collections.