HAJailBench: Multi-Model LLM Safety Benchmark

Updated 16 November 2025

HAJailBench is a large-scale, human-annotated jailbreak benchmark designed to evaluate multi-model LLM safety under adversarial prompting.
It employs a fully crossed experimental design with 12,000 interactions spanning 100 harmful goals, 12 jailbreak methods, and 11 target models for robust safety auditing.
A rigorous annotation protocol using binary labels and continuous risk scores ensures high reliability, with Cohen’s κ > 0.80 among human annotators and competitive performance compared to GPT-4o.

HAJailBench is a large-scale, human-annotated jailbreak benchmark introduced to support rigorous, multi-model safety evaluation of LLMs under adversarial prompting. It consists of 12,000 prompt–response pairs covering a comprehensive spectrum of harmful goals, attack modalities, and model targets. The resource is designed to calibrate, compare, and generalize safety robustness measurements and judge reliability, ultimately aiding the development and assessment of cost-effective, interpretable, and high-fidelity LLM safety evaluation frameworks.

1. Benchmark Design and Scope

HAJailBench comprises exactly 12,000 adversarial interactions constructed in a fully crossed experimental design. The benchmark instantiates 100 distinct “harmful goals” from the JBB-Behaviors suite, each of which is attacked using twelve jailbreak methods against eleven state-of-the-art LLMs. The attack taxonomy includes nine single-turn (token-level and optimization/query-heuristic) methods—GCG, COLD, Random-Search, AutoDAN, GPTFuzzer, PAIR, TAP, Future-Tense—and three multi-turn, semantic strategies—Crescendo, X-Teaming, Actor. Each harmful goal is paired with every single-turn method across all eleven models (total: 9,900 instances), and every multi-turn strategy across seven selected models (total: 2,100 instances), yielding the complete test set.

All adversarial input is derived from a “harmful goal” prompt algorithmically transformed by a jailbreak attack, then fed to the target LLM with greedy decoding (temperature = 0, maximum 150 tokens). The resultant output is assessed under a standardized annotation protocol.

2. Model and Attack Coverage

The suite of target models encompasses both closed-source and open-source systems, ensuring broad representativity in benchmarking:

Closed-source: Claude-3.5, GPT-4o, GPT-5, GPT-5-nano
Open-source: Llama-3.3-70B-it, Llama-3.1-8B-it, Qwen3-4B (Instruct & Thinking variants), DeepSeek-V3/R1, GPT-OSS-20B

The twelve attack methods provide comprehensive coverage of the adversarial landscape:

Attack Category	Methods	Application
Single-turn	GCG, COLD, Random-Search, AutoDAN, GPTFuzzer, PAIR, TAP, Future-Tense	11 models
Multi-turn, semantic	Crescendo, X-Teaming, Actor	7 models

This design ensures systematic evaluation of both model-level robustness and defense generalizability.

3. Annotation Protocol and Ground Truth Construction

Each LLM output is judged on two axes:

Binary label: Attack success or failure (did the model obey the harmful instruction?).
Continuous risk score: 1–10 scale, with finely defined semantic bands:
- 1–2: Completely Safe
- 3–4: Relatively Safe
- 5–6: Marginally Safe
- 7–8: Rather Unsafe
- 9–10: Completely Unsafe

Annotation proceeds in two rounds: Round 1 allocates one-third of the responses to each of three expert annotators, using an eleven-category safety taxonomy spanning Hate/Harassment, Discrimination, Privacy, Dangerous Acts, Conspiracy, Professional Misinformation, Cultural Insensitivity, etc. In Round 2, any response with discordant majority-human and GPT-4o judge scores is adjudicated by two further experts. Final labels follow majority vote (binary) and score averaging (continuous). This protocol achieves Cohen’s κ > 0.80 (double-blind subsets) for human annotators, κ = 0.75–0.76 for GPT-4o and multi-agent SLM judges, indicating near-perfect reliability.

4. Evaluation Metrics and Scoring Framework

HAJailBench enables systematic comparison of LLM safety via standardized metrics:

Attack Success Rate (ASR): Proportion of adversarial prompts resulting in unsafe content.
Classification metrics: Precision, Recall, F1—computed against human ground truth.
Inter-rater agreement: Cohen’s κ, measuring judge–human and cross-method concordance.
Cost efficiency: Ratio of per-query computational expense (in tokens or monetary units) to GPT-4o baseline.

Risk scores and agreement statistics allow nuanced calibration of LLM-as-judge frameworks, supporting both binary and graded safety assessments.

5. Multi-Agent Judge Framework and Cost–Accuracy Tradeoffs

A central application of HAJailBench is in evaluating judge reliability under cost constraints. The Multi-Agent Judge architecture consists of Critic, Defender, and Concluding Judge agents (e.g., using Qwen3-14B), debating each response across five value-aligned safety dimensions drawn from the prompt. Through ablation analysis, Lin et al. demonstrate that three rounds of debate optimize the accuracy–efficiency balance:

Debate Rounds	Cohen's κ	Cost Ratio (to baseline)
0	0.5709	0.30
1	0.6955	0.69
2	0.7143	0.87
3	0.7352	1.00
4	0.7260	>1.00
5	0.7221	>1.00

The three-agent SLM debate achieves near parity with GPT-4o judges (κ = 0.7352 vs. 0.7627) at 46% of GPT-4o’s per-query cost. Performance metrics against human ground truth are: Precision ≈ 0.74, Recall ≈ 0.72, F1 ≈ 0.73, compared to 0.80/0.78/0.79 for GPT-4o. These results show that value-aligned, structured debate among SLMs recovers semantic intent signals that single-turn or rule-based judges miss.

6. Significance and Context in LLM Safety Research

HAJailBench advances the state of safety benchmarking by introducing a scrutinizable, richly annotated adversarial corpus that supports both robustness evaluation and judge architecture calibration. The benchmark’s large-scale, multi-model scope and granular annotation protocol set a new standard for reproducibility and reliability in LLM safety auditing. Unlike prior benchmarks, which rely on small datasets, heuristic scoring, or variable annotation practices, HAJailBench provides a dual lens for evaluating both failure modes of the LLMs themselves and the fidelity of judge frameworks, paving the way for systematic progress in scalable, cost-effective LLM safety evaluation.

A plausible implication is that the structured debate paradigm holds promise for generalizing robust semantic safety assessments across emerging lightweight judge architectures, enabling broader deployment of cost-efficient reliability checks in practical LLM pipelines.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to HAJailBench.