- The paper introduces bandit algorithms to select optimal jailbreak strategies, reducing query complexity from O(nT) to O(T) and achieving near-optimal success rates.
- It demonstrates that automated prompt enhancement via the FRANKENSTEINBENCH benchmark boosts query sophistication and adversarial performance in both transfer and continual attack scenarios.
- Experimental evaluations on 15 open-weight LLMs reveal significant improvements in attack success rates, underlining the need for adaptive red-teaming defenses.
Jailbreaking for the Average Jane: Bandit Algorithms for Efficient LLM Safety Evasion
Motivation and Research Questions
This paper targets the prevailing challenge in LLM safety: operationalizing a scalable red-teaming process for non-expert adversaries, i.e., the "average Jane". With a proliferation of jailbreak techniques, the central question is not whether jailbreaks exist, but how accessible and effective they are for non-specialist actors seeking to bypass model safety guardrails. The research is formulated through two core questions:
- RQ1: Can an attacker efficiently learn an optimal policy for selecting jailbreaks tailored to their use-case?
- RQ2: Can malicious queries be automatically enhanced to systematically elicit actionable, harmful responses?
Methodology
Bandit-based Jailbreak Selection
Classically, exhaustive evaluation across jailbreaks is infeasible (nT queries required), inviting detection and practical barriers. The paper refines jailbreak selection to an online learning problem using multi-armed bandit algorithms—EXP3, Thompson Sampling, LinUCB, LinearCB, and SquareCB. With only O(T) probes, the attacker learns a distribution concentrating on optimal jailbreaks, minimizing regret against the best-in-hindsight policy. This approach supports continual adaptation and efficient exploitation, outperforming naive uniform sampling and brute-force strategies in both theoretical guarantees and empirical ASR.
Two attack scenarios are formalized:
- Transfer Attack: Policy learned in exploration (offline) phase, applied unchanged to exploitation (online) phase.
- Continual Attack: Policy updated during exploitation for joint exploration and exploitation, further optimizing ASR.
Query Enhancement and Benchmarking
Existing safety benchmarks prioritize domain coverage over query sophistication, leading to low-depth prompts that models frequently reject. The paper introduces FRANKENSTEINBENCH, a curated benchmark of 11,279 malicious queries, synthesized from seven safety benchmarks and further enhanced via LLM-driven prompt engineering. Dual-use complexity (technical jargon, procedural specificity) is algorithmically injected into queries, increasing their baseline and jailbroken ASR significantly. Query complexity is adjudicated via an ensemble classifier combining readability metrics and an LLM-as-judge, yielding 89.2% accuracy on held-out validation.
Experimental Evaluation
The attack pipeline is evaluated on 15 open-weight LLMs (270M–120B parameters), testing 70 jailbreaks across six high-stakes domains. Using ~12M query-response pairs, key results include:
- Baseline Harmfulness: Even without jailbreaks, FRANKENSTEINBENCH queries achieve 44% ASR, indicating inherent benchmark potency.
- Bandit Algorithm Superiority: Transfer Attack scenarios show bandit algorithms (e.g., Thompson Sampling) outperforming the best static jailbreaks, achieving ASR up to 49% on safety-aligned models where powerful jailbreaks only yield 6%.
- Complex Query Impact: Complex queries raise baseline ASR to 50% (vs. 39% for simple), and with jailbreaks, boost ASR by up to 26% across models.
- Multiple Passes: Allowing multiple jailbreaks per query (e.g., 5-pass Transfer Attack) enables ASR up to 97%.
- Resilience to Jailbreak Set Quality: Bandit algorithms maintain >50% ASR even when only weak jailbreaks are available, vastly outperforming uniform sampling which drops to 20%.
Ablation and Transferability Analyses
The robustness of the attack pipeline is interrogated via several ablations:
- Domain Transferability: Exploration on non-target domains yields similar exploitation ASR, negating the need for domain-specific exploration.
- Model Transferability: Attack policies derived from proxy models (smaller or larger variants) transfer effectively to target models, with minimal reduction in ASR.
- Exploration Horizon: ASR increases with longer exploration, supporting improved policy learning in both attack scenarios.
- No Exploration Phase: Bandit algorithms suffer in performance without an exploration set; full-information algorithms (RWM) are less affected.
Discussion, Limitations, and Future Directions
The paper highlights the principal limitation—restricted evaluation on proprietary models due to access and cost, and restriction to single-turn attacks in English. Multi-turn jailbreaks and cross-lingual attacks (where LLMs are more vulnerable) are proposed as future extensions, as is incorporation of attacker-side costs in the bandit formulation. Notably, context-aware bandit algorithms underperform, suggesting that the current embedding and regression oracle configurations merit further investigation.
The results demonstrate that—even with basic skills and limited resources—a non-expert can mount highly effective, adaptive jailbreak attacks by leveraging bandit algorithms and automatable complex query synthesis. The attack pipeline is efficient, easy to implement, and achieves robust transfer across domains and models.
Implications
Practical Implications
The findings substantiate concerns that the average, non-expert user can operationalize sophisticated red-teaming attacks. Model safety alignment is insufficient when adversarial knowledge is democratized and attack policy can be efficiently learned. For defenders, this necessitates moving beyond static guardrails. Red-teaming and safety evaluation regimes must adopt similar bandit-based adaptive strategies, encompass complex and dual-use queries, and anticipate rapid transferability across domains and models.
Theoretical Implications
From a theoretical perspective, the paper advances the framing of LLM jailbreak selection as an online bandit problem, formalizing regret guarantees and scalable optimization strategies specific to the black-box context. The demonstrated efficacy of bandit algorithms in adversarial LLM interaction prompts further theoretical exploration into regret minimization under evolving jailbreak sets, per-arm cost constraints, and context-aware policy optimization.
Directions for Future Research
Extensions to multi-turn attack settings, multilingual attack scenarios, and defenses incorporating bandit-based adversarial training are crucial future directions. Additionally, investigating how context embeddings and regression oracle design choices affect the performance of contextual bandit algorithms in this operational setting remains an open question.
Conclusion
This work operationalizes LLM jailbreaking for non-expert adversaries, demonstrating that efficient bandit-based online learning paired with automated prompt enhancement enables near-optimal attack success rates across state-of-the-art models. FRANKENSTEINBENCH provides a realistic, high-quality benchmark for safety evaluation, revealing persistent vulnerabilities. The research calls for adaptive, cost-efficient red-teaming protocols and inspires further inquiry into defenses robust to bandit-style adversarial optimization (2606.26936).