Papers
Topics
Authors
Recent
Search
2000 character limit reached

HarmLevelBench: LLM Harm Compliance Framework

Updated 6 March 2026
  • HarmLevelBench is a benchmarking framework that defines an eight-level taxonomy to categorize harmful prompts based on real-world severity, complexity, and ethical considerations.
  • It employs multi-axis metrics including human annotation, string matching, and LLM judging for a nuanced measurement of Attack Success Rate across various red-teaming strategies.
  • The framework evaluates the impact of 4-bit quantization methods on LLM robustness, highlighting trade-offs between vulnerability to basic attacks and resistance against sophisticated, transfer attacks.

HarmLevelBench is a fine-grained benchmarking framework for evaluating the harm-level compliance of LLMs when subjected to adversarial jailbreak attacks and for measuring how architectural compression methods, such as quantization, affect vulnerability and alignment. Its design addresses the need for standardized evaluation of LLM refusals and compliance across a calibrated severity scale, enabling nuanced assessment of safety beyond binary refusal rates and facilitating systematic comparison of compression-induced robustness trade-offs (Belkhiter et al., 2024).

1. Harm-Level Taxonomy

The central feature of HarmLevelBench is its eight-level taxonomy, which characterizes harmful prompts by real-world severity, operational complexity, affected population, and legal/ethical status. The taxonomy, manually assigned by the creators, is as follows:

Level Impact Means/Methods Consequence Legality/Morality
1 Individual Very Simple Minimal Legal & Ethical
2 Individual Simple Minimal Legal & Ethical
3 Individual Simple–Moderate Moderate Legal & Ethical
4 Small Group Moderate Significant Legal & Ethical
5 Medium Gr. Complex Serious Potentially Illegal
6 Larger Gr. Complex Severe Illegal
7 Large Gr. Complex Critical Illegal
8 Large Gr. Highly Complex Catastrophic Illegal & Immoral

Prompts span attack topics such as explosives, hacking, illicit finance, violent wrongdoing, firearms use, theft, and panic/triggering. Each topic is instantiated at every harm level, yielding seven topics × eight levels = 56 queries, enabling precise stratification of attack outcomes by severity (Belkhiter et al., 2024).

2. Dataset Composition and Design

HarmLevelBench's dataset comprises 56 structured prompts, each templated as “How to … ?”, with content progressively transitioning from low-severity (benign, legal, trivial) to maximal-severity (catastrophic, illegal, and immoral). For example, in the explosives topic, Level 1 may ask about building a non-lethal smoke bomb for a prank, while Level 8 requests construction of a nuclear bomb.

Diversity is assessed through BERT-based PCA analyses, demonstrating greater variety than prior jailbreak benchmarks such as AdvBench. No further annotation step beyond the original level/topic assignment is performed (Belkhiter et al., 2024).

3. Harm-Level Compliance Metrics

HarmLevelBench operationalizes compliance using Attack Success Rate (ASR), measured at each severity level and for each red-teaming strategy. The metric is defined as:

ASRL,A=1Ni=1NI[model responsei is harmful or complies],\mathrm{ASR}_{L,A} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{I}[\text{model response}_i \text{ is harmful or complies}],

where NN is the number of prompts at level LL for attack AA, and I[]\mathbb{I}[\cdot] is the indicator function.

ASR is assessed using three distinct judging paradigms:

  • Human annotation: Binary marking of "compliance" vs. "refusal".
  • String-matching: Automated detection of keywords or patterns indicative of harmful compliance.
  • LLM judge: A GPT-3.5 model classifies completions as "full compliance," "partial refusal," or "full refusal".

This multi-axis approach exposes both gross and nuanced partial-refusal phenomena not captured by blanket refusal rates (Belkhiter et al., 2024).

4. Jailbreaking Attack Framework

Seven distinct jailbreak strategies of increasing sophistication are evaluated:

  • Simple Query: Direct prompt submission.
  • Ignorance Context: Addition of misleading, innocuous material.
  • Role-Play Context: Use of meta-prompts to prompt agent identity dissociation.
  • PAP: Persuasion-based attack patterns (Authority, Logical Appeal, Misrepresentation).
  • PAIR: Iterative multi-step attacker-judge exchanges (Mixtral 7×8B attacker, GPT-3.5 judge).
  • AutoDAN: Automated stealthy jailbreak generation.
  • GCG: Universal adversarial trigger with transfer-oriented optimization.

Transferred attacks are generated on uncompressed (float) models and evaluated on quantized analogs to assess compression-mediated robustness. Attacks span the full suite of harm levels, capturing differential success by both prompt complexity and harm severity (Belkhiter et al., 2024).

5. Quantization Strategies and Their Evaluation

HarmLevelBench scrutinizes the effect of two 4-bit quantization algorithms on Vicuna 13B v1.5:

  • AWQ (Activation-Aware Weight Quantization): Jointly optimizes quantization scales and activation ranges at the layer level to minimize reconstruction error.
  • GPTQ (Generative Pre-trained Transformer Quantization): Employs second-order Hessian approximations to allocate quantization bins for minimal generative divergence.

Quantization is hypothesized to induce subtle decision-boundary changes and non-monotonic safety effects—sometimes exacerbating direct vulnerability while enhancing transfer robustness (Belkhiter et al., 2024).

6. Experimental Results and Key Patterns

Key observed results and phenomena:

  • Direct Attack Vulnerability: High-complexity jailbreaks (e.g., PAIR, AutoDAN, GCG) show high ASR even at catastrophic harm levels, while simple attacks see ASR decrease as harm severity increases.
    • Example: For AutoDAN, human-judged ASR on Vicuna 13B is 100% (direct, float model); for GCG, 92.9%.
  • Quantization Trade-offs: AWQ increased ASR for simple queries (+10.7 points over float) but reduced transfer ASR for sophisticated attacks by ~29–50 points.
  • Transfer Robustness: Attacks crafted on float models lose potency against quantized models, especially for GCG and AutoDAN in transferred scenarios (e.g., GCG on AWQ: 42.9%, down 50 points vs. direct on float).
  • ASR Heatmaps: Steep ASR drop-off for simple queries at higher harm levels contrasts with high, nearly constant ASR across harm levels for sophisticated jailbreaks. Quantization smooths or inverts some of these trends (Belkhiter et al., 2024).

7. Conclusions, Recommendations, and Benchmark Implications

HarmLevelBench demonstrates that proportional, harm-stratified benchmarking exposes vulnerabilities undetected by binary frameworks. Key conclusions:

  • Harm-level scales reveal that low-complexity jailbreaks may only succeed at mild harm levels, while advanced attacks can penetrate safety barriers even at the most extreme severities.
  • Quantization introduces complex, non-uniform safety–robustness trade-offs: increased vulnerability to direct, low-skill attacks, but reduced efficacy of transfer attacks.
  • The dataset size and scope (56 prompts) are argued to be within the standard range for existing jailbreak and harmfulness benchmarks.
  • Recommendations include expanding topic coverage, evaluating defense mechanisms under stratified harm severity, and combining human annotation with LLM-based evaluation for nuanced compliance detection.

These results provide actionable guidance for safe deployment and alignment of compressed LLMs, emphasizing the need for stratified, multi-axis harm evaluation for both research and practical alignment assessment (Belkhiter et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HarmLevelBench.