LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet (2408.15221v2)

Published 27 Aug 2024 in cs.LG, cs.CL, cs.CR, and cs.CY

Abstract: Recent LLM defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.

Citations (14)

View on Semantic Scholar

Summary

The paper reveals that LLM defenses effective against single-turn automated attacks fail under multi-turn human jailbreaks.
It utilizes a dataset of 2,912 prompts from 537 conversations to show attack success rates increasing by 19–70% on models like CYGNET and RMU.
The study urges the development of enhanced evaluation frameworks and defense strategies to secure LLMs against real-world, multi-turn adversarial threats.

Multi-Turn Human Jailbreaks in LLMs

The paper presents an incisive analysis on the vulnerabilities of LLMs when subjected to adversarial attacks, specifically focusing on multi-turn human jailbreaks. The authors highlight a significant gap in current robustness evaluation methodologies that primarily consider automated, single-turn adversarial scenarios. These evaluations often present LLM defenses with single-digit attack success rates (ASRs), suggesting near foolproof systems. However, the real-world application often demands rigorous appraisal under more complex threat models that encompass multi-turn human interactions, mirroring scenarios encountered in modern chat interfaces.

Methodology and Findings

The authors compile a dataset—Multi-Turn Human Jailbreaks (MHJ)—consisting of 2,912 prompts across 537 multi-turn conversations. This robust dataset emerges from commercial red teaming projects and public release is intended to fortify future research. By employing human red teamers rather than relying solely on automated attacks, the authors expose the limitations of existing defenses, revealing a marked increase in attack success rates. Specifically, these multi-turn human jailbreaks demonstrate a significant ASR increase—between 19% and 65% higher than the most successful automated attacks on HarmBench, a dataset encompassing a diverse range of harmful behaviors.

A noteworthy finding is the susceptibility of defenses that appear robust under traditional evaluation schemes, such as CYGNET and DERTA. CYGNET's defenses, achieving a mere 0% ASR with automated attacks, succumb to human efforts with a 70.4% ASR. Similarly, the paper shows a significant performance gap between human and automated attacks against RMU, a machine unlearning defense, further asserting human red teaming as a more comprehensive measure of LLMs' vulnerabilities.

Implications and Future Directions

The paper has far-reaching implications in the field of AI safety and model robustness. The findings accentuate the necessity of expanding threat models beyond single-turn adversarial frameworks to better simulate realistic, adversarial use-cases that involve multi-turn disruptions. The MHJ dataset enhances this conversation by providing concrete adversarial examples developed through methodical human engagement, thereby acting as a catalyst for developing more adept automated attack simulations.

The insights suggest a paradigm shift toward more sophisticated multi-turn testing paradigms and highlight potential vectors for enhancing model robustness, such as integrating sophisticated adversarial plays within the training routines or employing improved internal state management to fortify against prolonged adversarial interactions.

Conclusion

With MHJ's public release, the community gains access to an invaluable resource that demonstrates the vulnerability of existing LLMs to multi-turn human jailbreaks. The research compels a reevaluation of current robustness evaluations and a renewed focus on advancing LLM defenses to address complex, multi-turn human interactions. As AI continues to integrate into diverse societal and professional roles, ensuring the robustness and safety of these systems against realistic threat models becomes paramount. The paper, therefore, stands as a seminal contribution that bridges current robustness evaluation practices with the demands of imploringly safe AI deployment in real-world environments.

PDF Markdown

Related Papers

YouTube

Show All Videos