Attack Success Rates in Adversarial ML
- Attack Success Rates are a metric defined as the percentage of adversarial examples that successfully induce targeted misclassifications across various domains.
- Methodologies to optimize ASR include transferability-focused frameworks, perceptual distortion control, and semantic alignment to enhance attack efficacy.
- ASR serves as a benchmark for evaluating both adversarial attack strength and the robustness of defense strategies employed in machine learning models.
Attack Success Rates (ASRs) are a foundational metric in adversarial machine learning, quantifying the proportion of adversarial examples that successfully induce a targeted error in the victim model. Across domains—including computer vision, natural language processing, and speech recognition—ASR serves as both a primary measure of attack efficacy and a diagnostic insight into model robustness, transferability characteristics, and the effectiveness of newly proposed attack methodologies. Its definition, calculation procedures, role in benchmarking, and use in defense evaluation have become increasingly sophisticated as the adversarial threat landscape evolves.
1. Foundational Definition and Calculation
ASR is generally defined as the percentage of adversarial inputs that, when presented to a trained model, produce the intended misclassification (or, under a targeted attack, a specific predicted label). Formally, for a set of adversarial examples and a ground truth label set , the ASR is computed as
where is the indicator function, returning 1 if the attack succeeds (e.g., ) and 0 otherwise. This simple rate is widely applicable across tasks ranging from image classification (Zhou et al., 2020), text model evasion (Li et al., 2020), to physical attacks on perception systems (Sato et al., 2020).
The precise operationalization of “success” depends on context: for untargeted attacks it denotes any misclassification; for targeted attacks, the output must match a specific adversarial label. In speech recognition, success may require an exact transcript match or a substantive word error (Wu et al., 2021). In red teaming large LLMs, ASR is the sample mean over binary judgments of whether an attack was successful under each trial configuration (Freenor et al., 29 Jul 2025).
2. ASR as an Attack Benchmark and Comparative Metric
ASR has become the canonical metric for comparing adversarial attacks and defenses. New attack strategies are often validated explicitly by demonstrating higher ASRs relative to state-of-the-art baselines, with fixed perturbation budgets ( or perceptual equivalence constraints), and sometimes under strict black-box or physical-world conditions.
In image domain black-box settings, adversarial imitation attacks (Zhou et al., 2020) improved ASRs over substitute and query-based alternatives both by optimizing transferability and minimizing sample/query requirements. In targeted LLM attacks (Li et al., 2020), ASR is complemented with semantic and fluency assessments, but remains the lead indicator of generation efficacy. In physical attacks, such as dirty road patch attacks on automotive systems, reported ASRs exceeding 97.5% (often reaching 100%) directly quantify control over system behavior, capturing both reliability and real-world risk (Sato et al., 2020).
ASR is typically reported alongside related quantities such as average perturbation size (), number of queries, or attack time, to capture trade-offs among success, stealthiness, and efficiency.
Domain | Typical ASR Range | Context |
---|---|---|
Image classification | 70–99% | Depends on attack/defense, e.g. mimic-white-box |
Text classification | 80–98% | High ASR with low perturbation rates |
ASR (speech recognition) | 0–100% | 100% possible in white-box, much lower transfer |
Physical/world attacks | 50–100+% | Strong attacks (e.g., dynamic UV mapping) |
3. Methodological Factors Influencing ASR
Attack methodologies often revolve around maximizing ASR under tight operational constraints:
- Transferability-centric frameworks (e.g. adversarial imitation (Zhou et al., 2020), semantic cropping for LVLMs (Li et al., 13 Mar 2025), DropConnect-based self-ensembling (Su et al., 24 Apr 2025)) are explicitly designed to improve cross-model ASRs by focusing perturbations on features or regions most likely to generalize.
- Perceptual Distortion Control constrains optimization via auxiliary metrics (e.g., SSIM, MAD, LPIPS) and adaptive weighting. For example, the Perceptual Distortion Reduction (PDR) approach (Yang et al., 2021) joint-optimizes attack loss and distortion to achieve high ASR with minimal visual artifacts. Adaptive scaling of perturbation steps (Yuan et al., 2021) also increases ASR by preserving gradient magnitude/direction fidelity.
- Semantic Alignment and Targeting: In multimodal attacks on LVLMs, aligning adversarial examples to semantically meaningful features (rather than uniform perturbation) yields ASRs above 90%, even on robust commercial systems (Li et al., 13 Mar 2025). In NLP backdoor settings, label-specific triggers are optimized to induce targeted errors in ≥99% of cases (Zhao et al., 18 Mar 2024).
Perturbation co-adaptation—where attack updates overfit the surrogate model—drives down transfer ASR. Mitigation tactics such as randomness injection in model parameters (DropConnect) generate more generic, highly transferable perturbations (Su et al., 24 Apr 2025).
ASR is highly sensitive to attacker knowledge; white-box attacks tend to reach higher ASR compared to black-box or physical-world scenarios, where transferability and stealth constraints dominate.
4. ASR in Defense Evaluation and Robustness Assessment
ASR serves as a core metric not only for attack strength but also for defense efficacy. Defensive strategies aim to lower the ASR while preserving model utility on clean data:
- Input Transformation Defenses (e.g., JPEG compression, denoising, median blurring) may reduce digital perturbation ASR in controlled conditions, but often fail against adaptive or physical-world attacks without severely affecting benign accuracy (Sato et al., 2020).
- Backdoor Defense: Entropy-based separation of clean and poisoned data combined with CLIP-guided unlearning reduced ASRs to below 1% across 11 backdoor types, measured regardless of trigger complexity or attack style (Xu et al., 7 Jul 2025). Here, maintaining low ASR is explicitly correlated with effective defense under both clean- and clean-label attack settings.
- Latent Geometry Alignment: In LLM safety, geometry-aware alignment (GRACE) reduces the ASR by up to 39% by explicitly separating the latent representations of safe and adversarial outputs (Khanna et al., 10 Jun 2025). A high ASR after defense indicates failure of the mitigation to alter the relevant internal representations.
Emerging metrics such as the Attack Successful Rate Difference (ASRD) in NLP (Shen et al., 2022) further isolate the causal effect of triggers on ASR by subtracting out OOD and data-perturbation-induced misclassifications from the baseline.
5. ASR as an Indicator of Transferability and Robustness Gaps
ASR also exposes cracks in robustness and generalizability:
- Transferability Gaps: Even successful white-box image and audio attacks often fail to transfer in black-box settings; for instance, ASRs in ASR systems may drop to near zero for transferred adversarials, a stark difference from the image domain where cross-model ASRs remain high (Abdullah et al., 2020).
- Physical-World Resilience: Physical attacks that maintain high ASRs under extensive pose, lighting, and camera variation—such as dynamic-NeRF-based attacks on person detection (92.75% ASR on FastRCNN, 49.5% on YOLOv8 (Li et al., 10 Jan 2025))—are notable for their real-world impact where traditional patch attacks fall short.
- Prompt Discoverability in LLM Red Teaming: In prompt-based adversarial evaluation, ASR is interpreted as a distribution over repeated trials (discoverability), capturing the likelihood of attack success across random seeds, system states, or input contexts (Freenor et al., 29 Jul 2025).
High ASR reveals practical risk: methods achieving near-perfect ASRs using only hard-label access, minimal data, or zero queries demonstrate that model security cannot be guaranteed by obscurity, query-limiting, or holding back gradients (Zhou et al., 2020, Fang et al., 27 Jun 2024).
6. Limitations and Multidimensional Assessment
While ASR is essential, relying solely on ASR can mask nuances:
- In stealthy or OOD backdoor settings, high ASR may not pinpoint the effect of an explicit trigger but may instead reflect the model’s broader misclassification tendencies. The introduction of ASRD provides a more discriminative evaluation (Shen et al., 2022).
- In some domains, maximizing ASR without regard to perturbation perceptibility—in audio or image space—can yield impractical or human-detectable attacks. Recent approaches explicitly track both ASR and perceptual metrics (e.g., SNR, SSIM, semantic similarity) to ensure that high ASR is meaningful in real-world usage (Yang et al., 2021, Abdullah et al., 2021, Zhao et al., 18 Mar 2024).
- ASR as a statistic is most informative when reported along with the operational definitions of “success,” the attack threat model, and corresponding baselines.
7. Outlook and Future Research
The measurement and maximization of ASR provide essential objective foundations for adversarial research, but the metric’s interpretation continues to grow in sophistication:
- Ongoing work seeks to link ASR to underlying geometric, semantic, or representational properties that mediate attack transferability and defense effectiveness (Khanna et al., 10 Jun 2025).
- Red teaming frameworks increasingly leverage per-attack ASR distributions to characterize attack discoverability and system vulnerability in an operational sense (Freenor et al., 29 Jul 2025).
- Defense evaluation relies on tracking reduction in ASR as a direct indicator of system hardening, making it critical to architect methodologies and benchmarks that test ASR under the most challenging and realistic attack scenarios.
In summary, Attack Success Rate remains a central, task-invariant metric that encapsulates both adversarial threat and model robustness. Its mathematical foundation and empirical tracking are pivotal for both attack benchmarking and principled progress in designing robust machine learning systems.