Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 113 tok/s
GPT OSS 120B 472 tok/s Pro
Kimi K2 214 tok/s Pro
2000 character limit reached

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models (2507.12806v1)

Published 17 Jul 2025 in cs.AI and cs.CL

Abstract: The rapid rise of LLMs-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces MCPEval, an MCP-based framework that standardizes and automates deep evaluation of LLM-based agent models using a three-stage workflow.
  • The methodology employs LLM-driven task generation, iterative verification, and multi-dimensional evaluation, integrating tool call analysis with rubric-based LLM judgements.
  • The evaluation on ten models across five domains reveals performance hierarchies, execution-completion gaps, and validates the framework’s precision in diagnostic insights.

MCPEval: An Automated, MCP-Based Deep Evaluation Framework for AI Agent Models

The paper introduces MCPEval, an open-source, fully automated evaluation framework for LLM-based agent models, leveraging the Model Context Protocol (MCP) to standardize and scale the assessment of agentic capabilities across diverse domains. MCPEval addresses critical limitations in existing evaluation methodologies, particularly the reliance on static benchmarks, manual data collection, and the lack of deep integration with real-world tools and protocols.

Framework Design and Methodology

MCPEval is architected around a three-stage workflow: task generation, task verification, and model evaluation. The system utilizes MCP as the backbone for agent-environment communication, enabling seamless integration with native agent tools and external systems. The task generation process is LLM-driven, producing detailed, tool-specific instructions based on the capabilities exposed by MCP servers. Recognizing the inherent incompleteness of initial LLM-generated tasks, MCPEval incorporates an iterative verification phase, where a frontier agent executes the tasks, refines them upon failure, and establishes high-quality ground truth trajectories.

For model evaluation, MCPEval positions the model-under-test as an MCP client, requiring it to complete the set of verified tasks. The evaluation is multi-dimensional:

  • Tool Call Analysis: Rigorously compares the agent’s tool usage (name, parameters, order) against ground truth, reporting both strict and flexible matching scores.
  • LLM Judger Analysis: Employs rubric-based LLM assessment to score planning, execution flow, context awareness, requirement coverage, and output completeness.

This dual-perspective approach enables both granular operational diagnostics and high-level behavioral evaluation, with automated report generation for actionable insights.

Empirical Results

MCPEval is evaluated on ten state-of-the-art models (seven OpenAI, three open-source) across five real-world domains: Healthcare, Airbnb, Sports, National Parks, and Finance. The evaluation encompasses over 10,000 individual tasks and 50 model-domain combinations, representing one of the most comprehensive studies of LLM agent tool-use to date.

Key findings include:

  • Performance Hierarchy: OpenAI models (notably GPT-4o, GPT-4o-mini, GPT-4.1-mini, O3) consistently outperform open-source models in both tool call precision and LLM-judged output quality. However, smaller models (e.g., o4-mini) can match or exceed larger open models in certain domains, indicating opportunities for cost-effective deployment.
  • Execution-Completion Gap: All models exhibit a consistent gap between trajectory execution (planning, tool use) and completion quality (output synthesis, usefulness). This gap is more pronounced in open-source models and in domains with complex or less standardized APIs.
  • Domain Sensitivity: Healthcare and Finance domains yield the highest performance, attributed to standardized APIs and structured data. National Parks and Airbnb present greater challenges, with lower name/parameter match scores and larger execution-completion gaps.
  • Tool-Use Patterns: Parameter specification is the most common failure mode, and multi-tool coordination remains a significant challenge. Flexible matching criteria reveal that models often approximate correct tool usage but lack precision.
  • Evaluation Reliability: Strong correlations between tool call metrics and LLM-judged scores validate the framework’s methodology, while aspect-level analysis pinpoints strategic strengths (planning, adaptability) and operational weaknesses (tool usage, output completeness).

Implications and Limitations

MCPEval’s automated, protocol-driven approach enables scalable, reproducible, and fine-grained evaluation of agentic LLMs, facilitating rapid iteration and deployment in production environments. The framework’s ability to generate high-quality, verified trajectories also supports continual model improvement via fine-tuning.

Practical implications include:

  • Standardization: MCP-based evaluation ensures comparability across models and domains, supporting robust benchmarking and model selection for real-world applications.
  • Actionable Diagnostics: Fine-grained metrics and automated reports provide developers with targeted feedback for model optimization, particularly in tool-use and output synthesis.
  • Open-Source Availability: The public release of MCPEval lowers the barrier for reproducible research and industry adoption, accelerating progress in agentic AI.

Limitations:

  • The reliance on synthetic data and LLM-based judges may not fully capture the complexity of real-world interactions, and introduces potential biases in ground truth generation.
  • Computational costs for long trajectory evaluation and LLM-based judging can be significant, potentially limiting scalability for large-scale or long-horizon tasks.
  • The evaluation is sensitive to the reference model used for ground truth, which may penalize models with alternative but valid tool-use strategies.

Theoretical and Future Directions

MCPEval advances the state of agent evaluation by operationalizing protocol-level assessment and automating the end-to-end evaluation pipeline. The observed execution-completion gap highlights a fundamental limitation in current LLM architectures: strong procedural reasoning does not guarantee high-quality output synthesis. This suggests a need for architectural and training innovations targeting output generation, multi-tool coordination, and parameter specification.

Future research directions include:

  • Incorporating real-world, user-generated task data to complement synthetic benchmarks.
  • Developing more efficient, cost-effective judging mechanisms, potentially leveraging ensemble or hybrid human-LLM evaluation.
  • Enhancing verification strategies to reduce bias and improve the reliability of ground truth labels, possibly through cross-validation with multiple sources.
  • Extending the framework to support multi-agent and multi-modal evaluation scenarios.

Conclusion

MCPEval represents a significant step toward standardized, scalable, and actionable evaluation of LLM-based agents. By leveraging MCP and automating the full evaluation lifecycle, it provides the research and practitioner community with a robust tool for diagnosing, benchmarking, and improving agentic AI systems. The framework’s insights into model and domain-specific performance, as well as its identification of persistent architectural gaps, will inform both the development of next-generation LLM agents and the design of future evaluation methodologies.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com