Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (2406.01288v2)

Published 3 Jun 2024 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: Recently, Anil et al. (2024) show that many-shot (up to hundreds of) demonstrations can jailbreak state-of-the-art LLMs by exploiting their long-context capability. Nevertheless, is it possible to use few-shot demonstrations to efficiently jailbreak LLMs within limited context sizes? While the vanilla few-shot jailbreaking may be inefficient, we propose improved techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool. These simple techniques result in surprisingly effective jailbreaking against aligned LLMs (even with advanced defenses). For examples, our method achieves >80% (mostly >95%) ASRs on Llama-2-7B and Llama-3-8B without multiple restarts, even if the models are enhanced by strong defenses such as perplexity detection and/or SmoothLLM, which is challenging for suffix-based jailbreaking. In addition, we conduct comprehensive and elaborate (e.g., making sure to use correct system prompts) evaluations against other aligned LLMs and advanced defenses, where our method consistently achieves nearly 100% ASRs. Our code is available at https://github.com/sail-sg/I-FSJ.

PDF HTML Abstract

Insights into Improved Few-Shot Jailbreaking for Aligned LLMs

The paper "Improved Few-Shot Jailbreaking Can Circumvent Aligned LLMs and Their Defenses" by Zheng et al. addresses the critical challenge of jailbreaking LLMs, especially those that are safety-aligned. This work explores the refinement of few-shot demonstrations to effectively jailbreak state-of-the-art LLMs, posing a significant question regarding their robustness against adversarial attacks.

In recent developments, many-shot demonstrations have shown efficacy in jailbreaking LLMs by leveraging their long-context capabilities. However, the paper posits the potential of few-shot demonstrations to achieve similar outcomes even with limited context sizes. The authors introduce novel techniques such as the injection of specific system tokens and demo-level random search into a curated demo pool, yielding remarkably high attack success rates (ASRs) on models like Llama-2-7B and Llama-3-8B, despite the implementation of robust safety defenses.

Key findings include the achievement of over 80%, and in many cases, over 95% ASRs across various models and defenses. Notably, these results are accomplished without repeated attempts, which is a stark contrast to previous methods requiring multiple iterations. The implications of these findings are substantial, suggesting possible vulnerabilities in current alignment strategies and pushing for the development of more resilient safety measures.

The methodology pivots on the use of LLMs to generate adversarial but semantically coherent requests, sidestepping traditional defenses like perplexity filters and input preprocessing. While aligning models with human values through instruction fine-tuning and reinforcement learning from human feedback (RLHF) is customary, this paper underscores the need for enhanced methods capable of withstanding sophisticated jailbreak strategies like those proposed.

Moreover, the authors perform comprehensive evaluations across different LLMs and defenses, presenting a robust case for the effectiveness of their approach. The paper suggests that while many defenses currently in place—instruction fine-tuning, input-detection, and perturbation-based methods, among others—can mitigate certain attacks, their implementation remains challenging against the newly proposed few-shot jailbreaking technique.

Future research directions might include refining model training paradigms to inherently resist jailbreak attempts, increasing transparency in LLM architecture to facilitate better defense mechanisms, and expanding the scope of defense strategies. Additionally, continued investigation in LLM-assisted tool generation for adversarial purposes will be critical in preemptively countering evolving threat vectors.

Through this work, Zheng et al. contribute to the broader discourse on AI alignment and safety, emphasizing the delicate balance between advancing model capabilities and ensuring their alignment with user safety and ethical standards.