HarmBench Attack Success Rate Metrics

Updated 23 June 2026

HarmBench Attack Success Rate is a standardized metric that measures how often large language models produce harmful content under adversarial prompt attacks.
It employs formal mathematical definitions and rigorous experimental protocols to evaluate both attack strategies and defense mechanisms.
Empirical results reveal substantial variation across attack methods, with some achieving over 80% success even against strong defense systems.

HarmBench Attack Success Rate (ASR) is a central, standardized metric for quantifying the vulnerability of LLMs and defenses to adversarial prompt attacks in the HarmBench evaluation framework. It measures the proportion of attack trials that successfully induce a model to produce harmful or disallowed content, providing a uniform basis for academic comparison across attack and defense methods, model architectures, and evaluation scenarios.

1. Formal Definition and Core Mathematical Formulas

The HarmBench Attack Success Rate quantifies the frequency with which a target model generates harmful content under attack, as judged by an automated classifier tailored to HarmBench’s taxonomy of harmful behaviors (Mazeika et al., 2024).

Let

$P = \{x_1, \ldots, x_N\}$ be the set of $N$ harmful test prompts,
$f$ the target LLM,
$g$ the attack method,
$c(\cdot, y)$ a binary classifier returning 1 iff the generated output manifests the targeted harm $y$ .

The formal definition is: $\mathrm{ASR}(y, g, f) = \frac{1}{N} \sum_{i=1}^{N} c\bigl(f_T(x_i),\, y\bigr)$ where $f_T(x_i)$ is the model’s response truncated to $T$ tokens. This result can be averaged over a set of behaviors to produce a global ASR: $\overline{\mathrm{ASR}(g, f)} = \frac{1}{M} \sum_{j=1}^M \mathrm{ASR}(y_j, g, f)$

Single-configuration ASR, best-case ASR, and variant-aggregated measures are defined for attacks with internal parameters (Maple et al., 9 May 2026): $N$ 0 where $N$ 1 is the set of attack variants. Additional metrics such as Variant Sensitivity Measure (VSM) and Union Coverage (UC) are now recommended to capture the full threat profile: $N$ 2

2. Experimental Protocols and Evaluation Criteria

HarmBench provides an automated, taxonomy-driven red-teaming benchmark comprising hundreds or thousands of real harmful prompts spanning illicit advice, hate, and dangerous instructions (Mazeika et al., 2024). The canonical evaluation protocol is:

Attack each prompt with the adversarial method under test. For multi-shot or multi-turn attacks, a fixed query budget per prompt (e.g., 10 queries) is enforced (Li et al., 2024).
The model response is judged harmful if the HarmBench classifier returns non-refusal, counting both direct leaks and completions that follow an initial refusal with prohibited content.
The ASR is reported as the percentage of prompts yielding at least one successful harmful response: $N$ 3 as applied in (Li et al., 2024) and (Zhou et al., 20 Mar 2025).

Judgment protocols have evolved:

Harmful outputs are typically flagged by an automated refusal/jailbreak classifier (Mazeika et al., 2024).
Some recent works add external LLM-based judges or stricter scoring rubrics (e.g., explicit scoring for “direct fulfillment of prohibited intent”) (Wu et al., 24 May 2026).

3. Model- and Attack-Specific ASR Results on HarmBench

ASR statistics reveal substantial variation across attack strategies and model architectures. Below is a summarization of ASR values from representative published works:

Method / Attack	Target Model(s)	ASR (%)
GCG (white-box)	All (avg)	54.3
GCG-Transfer	All (avg)	38.8
PAIR (iterative)	All (avg)	40.7
TAP (tree-of-attacks)	All (avg)	45.2
Direct Request (no attack)	All (avg)	25.3
GCG-Advanced (garbled)	Closed, 7 commercial	42.6
Ours (Prompt Translation)	Closed, 7 commercial	81.8
MS_Reverse (multi-stream pert.)	Qwen3/DeepSeek/Gemini	~91 (up to 96.1)
ActorBreaker (multi-turn)	Multiple	79.0–86.0
Furina (uncertainty-driven)	Closed LLMs, LLaMA-3-8B	83.5–94.0
AutoRedTeamer	Llama-3.1-70B	82 (vs 67 for basel.)
Best-case ASR (Bijection, M-7B)	Mistral-7B	81.0
Union Coverage (36 variants)	Mistral-7B	100.0

Notable patterns:

Prompt translation methods effectively double black-box transfer ASR compared to garbled-suffix baseline (Li et al., 2024).
Multi-stream and fragmented attacks (as in MS_Reverse and Furina) achieve >90% ASR against strong closed and open-source LLMs (Yang, 10 Mar 2026, Wu et al., 24 May 2026).
Human baselines and persuasive attacks rarely exceed 27% and 16% ASR, respectively (Mazeika et al., 2024).

4. ASR Under Defensive Schemes

When assessed against strong defense stacks, raw ASR values are substantially reduced, but rarely to zero.

CivicShield achieves a 71.2% detection rate (i.e., 28.8% ASR) on HarmBench real-world scenarios, with a 95% Wilson CI of [67%, 75%]; the honest drop versus author-crafted attacks is 5.5 percentage points (Patil, 30 Mar 2026).
EvoDefense reduces ASR to <10% for five major attacks across seven models, vs. baseline ASR values as high as 72.5% (Li et al., 29 May 2026).

This framing highlights that detection rate (DR) and ASR are complementary: $N$ 4 As demonstrated, multi-layer defenses and adaptive guards (as in EvoDefense) suppress but do not eliminate the ASR under HarmBench evaluation.

5. Distributional Reporting: VSM, Union Coverage, and Limitations of Single-Configuration ASR

Recent research emphasizes the inadequacy of single-configuration ASR reporting:

High-variance attack families may yield best-case ASR that far exceeds the mean across all parameterizations. For instance, for bijection-jailbreak attacks on Mistral-7B, best-case ASR is 81%, but the mean ASR across 36 variants is only 15.94%, yielding $N$ 5 (Maple et al., 9 May 2026).
Union Coverage (UC) exposes the aggregate risk: in the same evaluation, UC=100%, showing that all prompts are vulnerable to some configuration even if no single variant covers all (Maple et al., 9 May 2026).

These results establish that:

Reporting only the strongest configuration risks underrepresenting the true attack surface.
Defenses need to mitigate the full configuration space, not just the most obvious or widely benchmarked.

6. Methodological Details: Judging, Query Budgets, and Edge Cases

HarmBench evaluations have converged on explicit methodological standards:

Prompt-judging: Automated response classification, often via open-source refusal/jailbreak classifiers or external LLM judges following a rubric (Mazeika et al., 2024, Wu et al., 24 May 2026).
Query Budget and Multi-Shot Attacks: Most state-of-the-art attack evaluations report ASR under a fixed query-per-prompt budget (e.g. 10-shot for translation-based attacks (Li et al., 2024); multi-turn settings for Furina and ActorBreaker (Wu et al., 24 May 2026)).
ASR Edge Cases: HarmBench marks as “successes” not only direct non-refusals but also “refusal-to-jailbreak” episodes where models initially claim to refuse but ultimately emit prohibited content.

Methodological nuances (e.g., prompt translation, attack interleaving, transfer benchmarks) are crucial: translation-based attacks outperform garbled ones by +39.2 pp, and attack diversity/combinations can raise the ASR by up to 0.20 absolute gain (Li et al., 2024, Zhou et al., 20 Mar 2025).

7. Trends, Implications, and Limitations

Key observations from HarmBench ASR reporting include:

No attack achieves universal success on all models, nor does any model maintain robust refusal across all attacks (Mazeika et al., 2024).
White-box and multi-turn attacks remain more potent than one-shot or persuasive paradigms.
Defenses may report inflated success on author-generated adversarial prompts; honest, independent benchmarks reveal a “drop” in measured detection rates (“honest drop” phenomenon) (Patil, 30 Mar 2026).
Reporting average, best-case, and union coverage ASR—alongside VSM—is now recommended to fully characterize both attack and defense efficacy (Maple et al., 9 May 2026).

Recent benchmarks underscore the urgency for comprehensive reporting and defense, as modern attacks routinely achieve ASRs >80% even against advanced, safety-aligned LLMs, with some defense frameworks suppressing but not eradicating the attack surface (Li et al., 2024, Patil, 30 Mar 2026, Li et al., 29 May 2026).