Papers
Topics
Authors
Recent
2000 character limit reached

Intent Success Rate (ISR) Metrics

Updated 20 December 2025
  • Intent Success Rate (ISR) is a metric that quantifies how effectively adversarial attacks both meet technical objectives and hide their true intent.
  • ISR is calculated as the fraction of obfuscated attack trials that bypass detection and reliably manipulate model outputs, using domain-specific criteria.
  • Empirical studies reveal that ISR varies with model architecture and attack strategy, guiding improvements in adversarial defense and robustness.

Intent Success Rate (ISR) quantifies the proportion of adversarial attacks or obfuscated operations in machine learning systems that both succeed in their technical objective and align with the attacker's hidden intent—typically defined by the evasion of direct intent attribution, successful manipulation of target outputs, or bypassing model safety mechanisms. ISR is employed in evaluations of adversarial robustness within LLMs, object detectors, and other sensitive AI domains. It is computed as the fraction of carefully designed, intent‐obfuscated attempts that fulfill all adversary criteria across the total number of evaluated cases, revealing system susceptibilities not captured by standard accuracy or harm rates.

1. Formal Definition and Mathematical Formulation

ISR is mathematically defined as the fraction of attack trials where the adversarial manipulation both eludes intent detection and attains its objective on the system. In the LLM jailbreak setting (Shang et al., 6 May 2024), for instance, the ISR is

ISR=#{successful obfuscated prompts}#{total obfuscated prompts}\mathrm{ISR} = \frac{\#\{\text{successful obfuscated prompts}\}}{\#\{\text{total obfuscated prompts}\}}

where a prompt is "successful" if it satisfies three criteria: (1) encodes illegal intent; (2) passes built-in model filters; (3) elicits harmful content generation. In object detection adversarial attacks (Li et al., 22 Jul 2024), ISR is

ISR=1Ni=1NSi\mathrm{ISR} = \frac{1}{N}\sum_{i=1}^N S_i

with Si=1S_i = 1 for those cases where a perturbation applied to a non-target context object causes system failure (e.g., successful target vanishing or mislabeling), by the specific detector’s criteria (e.g., IoU, confidence thresholds).

2. Methodological Frameworks for ISR Calculation

Methodologies for ISR calculation are highly domain-specific:

  • Obfuscated LLM jailbreaks: Datasets such as the Harmful Behavior Problems (HBP) set are mutated into obfuscated forms via syntactic rewriting or semantic ambiguity (Shang et al., 6 May 2024). Each obfuscated prompt is scored on the tripartite success axis (intent, filter evasion, harmful output). ISR reflects the percentage of such prompts yielding a "harmful" output.
  • Object detector attacks: Randomized and deliberate selection of target/perturbation pairs from COCO images form the basis for ISR assessment (Li et al., 22 Jul 2024). Success is determined by the detector’s inability to correctly predict the original target when a perturbation affects a separate object; ISR is computed over thousands of test cases, with distinct vanishing, mislabeling, and untargeted criteria.

3. Experimental Benchmarks and Reported ISR

ISR is subjected to large-scale, controlled evaluations—often with comparison to baseline and variant attack mechanisms.

Model Baseline ISR Obscure Intention ISR Create Ambiguity ISR Average ISR
ChatGPT-3.5 69.04% 82.12% 85.19% 83.65%
ChatGPT-4 46.15% 56.15% 50.38% 53.27%
Qwen-max 25.77% 55.19% 35.19% 45.19%
Baichuan2-13b 97.69% 94.62% 94.81% 94.71%
Model Vanishing ISR Mislabeling ISR Untargeted ISR
YOLOv3 90% 76% 30%
SSD 512 82% 68% 43%
RetinaNet 37% 15% 10%
Faster R-CNN 22% 12% 31%
Cascade R-CNN 28% 18% 22%

ISR consistently identifies attack effectiveness: rates above 70% are possible for best attack methods in both text and vision domains, notably when attacks exploit model weaknesses via intent obfuscation, contextual perturbation, or ambiguity.

4. Key Factors Influencing ISR in Adversarial Attacks

Empirical studies isolate multiple factors that modulate ISR:

  • Model architecture: 1-stage detectors (YOLOv3, SSD) are markedly more vulnerable than 2-stage architectures (Faster R-CNN, Cascade R-CNN) except RetinaNet, which features robust focal loss (Li et al., 22 Jul 2024).
  • Target properties: ISR increases as original target confidence decreases; vanishing attacks are more successful than mislabeling; perturbation size and proximity are critical success predictors.
  • Attack strategy: Deliberate selection of low-confidence targets and large, nearby perturbation regions maximizes ISR, approaching near-unity for certain configurations.

For LLMs, more sophisticated obfuscation strategies—syntactic rewriting (Obscure Intention) and semantic ambiguity (Create Ambiguity)—yield tolerances well above those of baseline attacks, with trade-offs between hallucination rates and direct success (Shang et al., 6 May 2024).

5. Application Domains and Security Implications

ISR measurement is foundational in the following domains:

  • Jailbreak detection in LLMs: ISR enables quantification of vulnerabilities in generative models when exposed to intent obfuscation. High ISR demonstrates the insufficiency of current filter-based safeguards against malicious prompt engineering (Shang et al., 6 May 2024).
  • Adversarial robustness in object detection: ISR operationalizes the feasibility of context-based adversarial perturbations and highlights security gaps where attackers mask their true targets for plausible deniability (Li et al., 22 Jul 2024).
  • Assessment of machine unlearning defenses: While not directly labeled as ISR, closely related attack success rates are used in systems evaluating their ability to “unlearn” specific concepts while remaining robust to intent-aware adversarial re-insertion (Yook et al., 29 Jul 2025).

Legal and policy implications abound: high ISR in intent-obfuscating attacks complicates attribution of malicious acts, impacting regulatory approaches to AI security and requiring reevaluation of “intent” as a proving ground for culpability.

6. Limitations and Blind Spots of ISR as a Metric

Several caveats arise regarding ISR’s usage:

  • Lack of granularity: ISR is a binary metric—does not distinguish between high-quality and low-quality success (e.g., photorealism vs. classifier-triggered noise).
  • Ambiguity of intent satisfaction: No qualitative metric of whether the generated output meaningfully fulfills the attacker’s deeper intent, beyond mechanical binary criteria (Yook et al., 29 Jul 2025).
  • Dataset and evaluation dependence: Results may not generalize without careful test design controlling for distributional shifts and model-specific quirks.

This suggests that ISR should be complemented by additional statistical or semantic analyses in future work.

7. Defensive Strategies Informed by ISR Analysis

ISR directly motivates improvements in adversarial defense architecture:

  • Query-level detection: Identification of high-obfuscation or ambiguous inputs prior to model execution is recommended (Shang et al., 6 May 2024).
  • Segmented review and output verification: Partitioning inputs and analyzing potential toxicity or evasive behavior across prompt stages mitigates ISR-driven vulnerabilities.
  • Context-aware and surround-patch detection: For vision systems, expanding adversarial defenses to cover not only objects but also their context can directly counter intent-obfuscating perturbations (Li et al., 22 Jul 2024).

A plausible implication is that ISR not only benchmarks adversarial risk, but also acts as a diagnostic index for the sufficiency of model auditing and response verification practices.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Intent Success Rate (ISR).