MCP-SafetyBench Safety Evaluation

Updated 24 December 2025

MCP-SafetyBench is a comprehensive suite that unifies evaluation frameworks, datasets, and methodologies to assess safety in LLMs operating via the Model Context Protocol.
It employs detailed attack taxonomies with 20 distinct vectors across server, host, and user layers to systematically quantify vulnerabilities in agentic workflows.
Empirical results reveal significant safety challenges, with state-of-the-art LLMs exhibiting up to 50% attack success rates, emphasizing the need for robust, multi-layered defenses.

The Model Context Protocol Safety Bench (MCP-SafetyBench) is a comprehensive suite of evaluation frameworks, datasets, and validation methodologies developed to systematically assess and improve the safety of LLMs operating agentically—particularly those interacting with external tools and services via the Model Context Protocol (MCP). MCP-SafetyBench enables benchmarking of LLM robustness against a wide spectrum of real-world attacks and protocol violations, leveraging realistic multi-turn workflows, thoroughly instrumented environments, and execution-based risk measurement. Through standardization of attack taxonomies, empirical metrics, and defensive best practices, MCP-SafetyBench has become foundational for diagnosing, quantifying, and mitigating the unique safety challenges posed by open, multi-server agentic LLM deployments (Zong et al., 17 Dec 2025, Fang et al., 16 Jun 2025, Radosevich et al., 2 Apr 2025, Tiwari et al., 26 Sep 2025, Zhang et al., 14 Oct 2025, Fu et al., 8 Nov 2025, Halloran, 29 May 2025).

1. Foundations: MCP Architecture and Threat Surface

MCP is a JSON-RPC-based protocol (over STDIO/SSE) that decouples LLM “Hosts” (agents) from heterogeneous “Servers” offering tools, data, and services. The protocol specifies discovery, invocation, and interpretation mechanisms for tool manifests, schemas, and outputs, thus enabling modular multi-step tool use without bespoke integration (Zong et al., 17 Dec 2025, Radosevich et al., 2 Apr 2025). In this architecture, each agent can dynamically reason about, select, and execute arbitrary tools or workflows spanning one to many external servers. MCP’s openness—critical for composability—substantially enlarges the attack surface, as both tool metadata and tool outputs are structured, natural-language-coupled, and interleaved with user or environmental contexts.

The standard threat model incorporates attacks originating from three analytical layers:

Server-side (tool configuration, API schemas, output manipulation)
Host-side (agent reasoning, intent injection, context tampering)
User-side (malicious queries, privilege escalation, credential phish)

This multi-layered exposure makes pure prompt-based defense insufficient; evaluation must span tool invocation, result handling, and chained multi-agent workflows.

2. Unified Taxonomies and Attack Coverage

MCP-SafetyBench formalizes safety risk using attack taxonomies with fine-grained distinctions by protocol layer and vector. The principal taxonomy instantiated in (Zong et al., 17 Dec 2025) comprises 20 atomic attacks, partitioned thus:

Class	Index Range	Sample Types (non-exhaustive)
Server-side	A₁–A₁₁	Param Poisoning, Cmd Injection, Tool Redirection,
		FS Poisoning, Network Request Poisoning,
		Function Dependency Injection, Function Overlap,
		Preference Manipulation, Tool Shadowing,
		Function Return Injection, Version Drift (Rug Pull)
Host-side	A₁₂–A₁₅	Intent Injection, Data Tampering,
		Identity Spoofing, Replay Injection
User-side	A₁₆–A₂₀	Malicious Code Execution, Credential Theft,
		Remote Access Control, Retrieval-Agent Deception,
		Excessive Privileges Misuse

Other benchmarks such as MSB (Zhang et al., 14 Oct 2025) and MCPSecBench (not fully detailed here) define complementary taxonomies (e.g., 12–17 attack vectors) that span planning, invocation, response, retrieval, and mixed chains: name-collision, preference manipulation, prompt injection within descriptions, out-of-scope parameter abuse, user-impersonating responses, false errors, tool-transfer, and retrieval injection.

These taxonomies provide rigorous test coverage, including direct prompt attacks, retrieval-agent deception (RADE/TRADE), protocol misuse, and chained (mixed) exploits.

3. Benchmark Structure, Methodology, and Measurement

A hallmark of MCP-SafetyBench is multi-turn, cross-server, execution-based evaluation on real MCP servers in realistic domains (Zong et al., 17 Dec 2025, Radosevich et al., 2 Apr 2025, Fang et al., 16 Jun 2025). Each instance is formalized as a 4-tuple: $\tau = (G, C, T_\text{available}, A)$ where $G$ is the user/task goal, $C$ is contextual prompt, $T_\text{available}$ is the toolset, and $A$ is the injected attack. Task design mandates that agents must interleave free-form reasoning, context updates, tool calling, result parsing, and cross-server coordination, frequently under conditions of environmental uncertainty. Attacks may be injected at any protocol stage, often dynamically.

Evaluation metrics are standardized across the literature:

Task Success Rate (TSR): Fraction of user goals successfully completed.
Attack Success Rate (ASR): Fraction of attack vectors that succeeded (i.e., agent violated a constraint or was subverted).
Defense Success Rate: $1 - \mathrm{ASR}$
Vulnerability Score $V$ : Identical to ASR.
Net Resilient Performance (NRP, (Zhang et al., 14 Oct 2025)): $NRP = PUA \times (1 - ASR)$ , where PUA is task performance under attack.

Defensive evaluation incorporates per-domain, per-attack-type, per-model, and side-specific analysis. Stealth/disruption detectors, output schemas, and attack-specific triggers are leveraged for ground-truth and automatic correctness validation.

4. Empirical Findings and Security Scaling Laws

MCP-SafetyBench has revealed robust, reproducible findings across multiple LLM and agent architectures (Zong et al., 17 Dec 2025, Zhang et al., 14 Oct 2025, Fang et al., 16 Jun 2025):

No model is immune: State-of-the-art closed and open-source LLMs still exhibit overall ASRs from ~30% to near 50% across real-world MCP attacks.
Safety–Utility trade-off: Higher task completion is inversely correlated with robustness (e.g., $r=-0.57$ , $p=0.041$ ). Stronger instruction-following and tool-use capabilities increase susceptibility.
Attack efficacy by class: Host-side attacks (e.g., Intent Injection, Identity Spoofing) achieve >80% success on average. Server- and user-side attacks are also consistently effective, with extremes observed (e.g., Tool Redirection ASR 70.6%, Network Request Poisoning only 7.7%).
Marginal benefit of safety prompts: Generalized prompt-based or defensive strategy reduces ASR by $<2\%$ and may worsen performance on semantic attacks.
Domain disparities: Financial analysis scenarios are most vulnerable (ASR $>$ 46%), web search among the safest (ASR $<$ 31%).
Open vs. closed models: No systematic difference detected; both families show “spiky” defense/robustness curves by attack.

These results are corroborated by MSB, which establishes average ASR around 41%, with specific vectors such as Out-of-Scope Parameter attacks exceeding 74% (Zhang et al., 14 Oct 2025), and MCPSafetyScanner pilot deployments (Radosevich et al., 2 Apr 2025).

5. Defensive Methodologies and Validator Frameworks

To mitigate MCP-specific risks, MCP-SafetyBench and related works instantiate multi-layered defenses and validation mechanisms:

Dynamic tool vetting: Real-time checking of tool metadata, call arguments, and JSON schemas, especially at runtime hooks in orchestrator frameworks (Tiwari et al., 26 Sep 2025).
Contextual least privilege: Enforcing contextually minimal tool and parameter scopes per invocation.
Coherence and schema validation: Output and input must pass round-trip (pre- and post-condition) checks; runtime schema misalignments are flagged, measured at $>78\%$ prevalence among vision centric tools (Tiwari et al., 26 Sep 2025).
Risk-tiered and sandboxed workflows: Automatic risk scoring, downgraded execution, or CI/CD-level automated safety regression testing.
Cross-layer monitoring and logging: Aggregation of host side (agent trace), server side (response/content), and system-level logs for detection of chained, multi-modal exploits (Fu et al., 8 Nov 2025).
Integration with McpSafetyScanner: Multi-agent, adversarial sample generation, vulnerability reporting, and remediation search are standardized (Radosevich et al., 2 Apr 2025).

In vision system pipelines, exclusive validators for schema conformance, coordinate convention, mask–image consistency, memory/lifecycle hygiene, privilege escalation, and provenance tracking have been open-sourced, yielding average 33.8 memory-scope warnings per 100 executions and privilege escalation rates of ~41% (Tiwari et al., 26 Sep 2025).

6. Extensibility, Bench Suite Implementation, and Ecosystem Integration

MCP-SafetyBench is built for extensibility and integration into LLM agent/host development cycles:

Integration points: Agent wrappers, plug-in scenario/task managers, modular attack modules, hybrid passive/active defense chains (whitelist, LLM-based sanitizer) (Fang et al., 16 Jun 2025).
Data and code resources: Open-source repository support for pipeline scripts, orchestrator management, and validator suites (e.g., https://github.com/adobe-research/mcp-safetybench, https://github.com/littlelittlenine/SafeMCP.git).
Customizable configuration: YAML or JSON-based scenario suite, defense toggles, number of evaluation trials, and attack vector selection.
Protocol evolution: Recommendations emphasize semantically grounded schemas, visual memory types, dynamic role-based binding, and runtime validator hooks (Tiwari et al., 26 Sep 2025).

Continuous research directions include synthetic evaluation data expansion (including adversarial logs, protocol misuse traces, supply chain compromise), enhanced training methodologies (DPO, SFT, GRPO, RAG-Pref (Halloran, 29 May 2025, Fu et al., 8 Nov 2025)), and cost-sensitive and calibration-adjusted risk metrics.

7. Impact, Recommendations, and Open Challenges

MCP-SafetyBench has established a de facto standard for measuring and improving agentic LLM safety in open, composable tool-use environments. Key recommendations distilled from empirical deployments include:

Run both clean and attack baselines: Quantify RAL, ASR per scenario and per vector; do not trust post-hoc acceptance rates.
Adopt strict refusal and detection procedures: Simple prompt-level refusals are insufficient; worst-case detection/denial should be systematically measured (Halloran, 29 May 2025).
Continuously red-team and retrain: Injection, mixed-chain, and obfuscated attacks remain highly effective—automated red-teaming and adversarial data synthesis are required for robustification.
Embed schema/runtime validation into CI/CD: Prevent protocol drift, privilege escalation, and memory leak introduction prior to deployment.
Enforce role-based binding and schema-type checks: Untyped or privilege-escalating tool connections are a primary failure mode.

Persistent open challenges remain in achieving robust refusal of sophisticated, FBA/TRADE-style exploits, holistic defense across multi-server/multi-agent pipelines, and scalable, low-overhead integration for production LLM agentic systems.

References:

(Zong et al., 17 Dec 2025) MCP-SafetyBench: A Benchmark for Safety Evaluation of LLMs with Real-World MCP Servers
(Fang et al., 16 Jun 2025) We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems
(Radosevich et al., 2 Apr 2025) MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits
(Tiwari et al., 26 Sep 2025) Model Context Protocol for Vision Systems: Audit, Security, and Protocol Extensions
(Zhang et al., 14 Oct 2025) MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents
(Fu et al., 8 Nov 2025) MCP-RiskCue: Can LLM infer risk information from MCP server System Logs?
(Halloran, 29 May 2025) MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment