Mislead Success Rate (MSR) in Adversarial Systems

Updated 12 January 2026

Mislead Success Rate (MSR) is a metric that measures the frequency of misleading outputs in interactions across adversarial, deceptive, and operational contexts.
It is defined mathematically by tracking sessions with deceptive responses, helping assess vulnerability and defense efficacy in models via empirical scoring.
MSR has practical applications ranging from evaluating ML defense mechanisms in language models to optimizing matching success in real-time systems like ride-hailing.

Mislead Success Rate (MSR) is a quantitative metric capturing the frequency at which a system, typically a model or defensive framework, induces a party—human, attacker, or another model—to accept misinformation, pursue unproductive avenues, or be diverted by deceptive outputs. MSR formalizes the ability to mislead within a defined interaction context, whether as a vulnerability assessment, a measure of defense or deception efficacy, or a predictive success rate in real-world systems. Although its specific instantiation varies, the unifying thread is the empirical tracking of sessions or instances in which misleading influence succeeds.

1. Mathematical Definitions of MSR

In adversarial and deception contexts, MSR is rigorously defined by counting the proportion of interactions where a misleading response occurs. For instance, in multi-turn LLM jailbreak defenses (Li et al., 7 Jan 2026), let $D = \{d_1, ..., d_n\}$ be multi-turn attack dialogues. For each response $r$ in dialogue $d_i$ , assign a “harmfulness” score $s(r)$ (typically $1$=benign, $2$=misleading, $3$--$5$=harmful). Define an indicator:

$I_m(d_i) = \begin{cases} 1 & \text{if } \exists r \in d_i : s(r) = 2 \ 0 & \text{otherwise} \end{cases}$

The MSR is then

$\text{MSR} = \frac{1}{|D|} \sum_{i=1}^{|D|} \mathbf{1}\left[\exists\, r \in d_i : s(r) = 2\right].$

In model-on-model deception (Heitkoetter et al., 2024), MSR is calculated as the average rate at which an evaluator model $M$ switches from correct to incorrect judgment due to explanations injected by a deceiver model $D$ . Given a set $\mathsf{QA}$ of question-answer pairs: $S(M,D,\mathsf{QA}) = \frac{|\{\mathsf{qa} \in \mathsf{QA} : M(\mathsf{qa})=1 \,\wedge\, M(D(\mathsf{qa}))=0\}|} {|\{\mathsf{qa} \in \mathsf{QA} : M(\mathsf{qa})=1\}|}$

$\mathrm{MSR}(M,D) = \frac{1}{2}\left( S(M,D,\mathsf{QA}_{\mathsf{correct}}) + S(M,D,\mathsf{QA}_{\mathsf{incorrect}}) \right).$

Analogous instantiations are used in matching contexts, e.g., ride-hailing systems (Wang et al., 2021), where MSR is defined as the fraction of matches with mutual acceptance:

$\mathrm{MSR}(P) = \frac{1}{|P|} \sum_{i=1}^{|P|} y_i, \quad y_i = \begin{cases} 1 & \text{if both accepted} \ 0 & \text{otherwise} \end{cases}$

2. MSR in Adversarial Machine Learning and Deceptive Defense

MSR is foundational in evaluating the efficacy of defense mechanisms against adversarial attacks in LLMs. In HoneyTrap (Li et al., 7 Jan 2026), MSR quantifies the occurrence of plausible but unhelpful (i.e., misleading) responses delivered in multi-turn jailbreak attack sessions. Unlike Attack Success Rate (ASR), which solely records harmful output suppression, MSR captures sessions where attackers are enticed into fruitless engagement.

High MSR values signify a honeypot effect: the system frequently responds in ways that appear productive but strategically waste attacker effort. In empirical tests, HoneyTrap achieves average MSR increases of +89--118% over baselines (e.g., from 0.24 to 0.53 for GPT-3.5-turbo, and from 0.33 to 0.70 for Gemini-1.5-pro), indicating a major shift toward effective misdirection across sessions.

3. MSR as a Vulnerability Metric in Model-on-Model Deception

In deception risk assessment (Heitkoetter et al., 2024), MSR functions as a vulnerability metric quantifying the susceptibility of models to misleading explanations from other models. Large-scale experiments using the MMLU benchmark demonstrate that both weak and strong models are readily deceived. For instance, GPT-3.5’s capability drops from ≈0.70 to ≈0.20 under self-generated deceptive prompts, corresponding to an MSR ≈0.5.

Detailed matrix results reveal consistent high MSR across evaluator-deceiver pairings (0.34--0.64), and robustness correlates negatively with MSR (Pearson $r \approx -0.5$ ). The skill of the deceiving model also inversely correlates with its misleading effectiveness: highly “truth-aligned” models produce explanations that are less frequently accepted as valid by their peers.

Evaluator \ Deceiver	L2-7B	L2-13B	L2-70B	GPT-3.5
GPT-3.5	0.35	0.37	0.34	0.42
L2-70B	0.48	0.50	0.46	0.52
L2-13B	0.54	0.55	0.52	0.57
L2-7B	0.61	0.62	0.60	0.64

This regime shows that models remain highly susceptible to well-crafted but incorrect explanations, underlining the necessity for benchmarked defenses.

4. MSR in Real-Time Systems: Ride-Hailing Match Prediction

The term MSR also appears as Matching Success Rate in operational domains such as real-time ride-hailing (Wang et al., 2021). Here, MSR measures the empirical rate at which matched passenger-driver pairs mutually accept (vs. cancel) orders. Predictive modeling (via the Multi-View architecture and knowledge-distillation frameworks) aims to maximize MSR in assignment algorithms, benefiting platform efficiency and user satisfaction.

High MSR signals optimal matching and reduced cancellation, while prediction reliability is validated experimentally by lifted AUC and reduced RMSE relative to alternate architectures. Limitations emerge under extreme data scarcity, requiring model simplification via knowledge distillation for deployment efficiency.

5. Computation Pipeline and Annotation Requirements

Robust MSR measurement requires explicit pipelines and annotation. In adversarial settings (Li et al., 7 Jan 2026), each attack session is logged, responses are scored by an automated judge (e.g., GPT-Judge or ensemble), and sessions are flagged if any reply qualifies as misleading (score=2). In deception detection (Heitkoetter et al., 2024), evaluator outputs are tracked for “flip” events after deception—only those originally correct and later accepting an incorrect inference contribute to MSR.

For matching contexts (Wang et al., 2021), MSR calculation is straightforward given binary acceptance indicators per match. Annotation reliability and judge calibration are critical; mis-scoring or threshold errors can bias aggregate MSR.

6. Limitations and Proposed Defenses

The MSR framework possesses inherent caveats:

Judge calibration dependency: Inconsistent labeling or ambiguous criteria for what constitutes “misleading” (score=2) can skew metrics (Li et al., 7 Jan 2026).
Session-level granularity: Counting any session with a single misleading reply as a full success may exaggerate influence; finer-grained, per-turn metrics would capture nuances.
No per-turn weighting: A session with multiple diversions counts equivalently to one with only a trivial diversion.
Threshold choice: Real-world misleadingness may span a spectrum; a binary threshold (e.g., exactly score=2) is a simplification.

Addressing MSR-inferred vulnerabilities includes defense strategies such as baseline calibration with deterministic deceivers, sycophancy steering via activation contrast, few-shot or chain-of-thought prompting, retrieval augmentation for external grounding, inference-time intervention protocols, and human-in-the-loop oversight dashboards (Heitkoetter et al., 2024).

7. Relationship to Complementary Metrics

MSR stands in strategic complement to metrics such as ASR (harmful output suppression) and ARC (Attack Resource Consumption). In defensive architectures (Li et al., 7 Jan 2026), high MSR aligns with high ARC—misled sessions are resource-intensive for attackers, prolonging effort and computation. The combination of high MSR and high ARC offers a dual proof of effective deception and defensive cost imposition. In operational systems, MSR serves as a direct indicator of system efficacy, but should always be contextualized within broader performance, robustness, and efficiency metrics.

Collectively, MSR provides an essential measure for analyzing and benchmarking deception, defense, and matching success across adversarial, operational, and hybrid human-model environments.