Papers
Topics
Authors
Recent
2000 character limit reached

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Published 3 Jun 2024 in cs.CL, cs.AI, cs.CR, and cs.LG | (2406.01288v2)

Abstract: Recently, Anil et al. (2024) show that many-shot (up to hundreds of) demonstrations can jailbreak state-of-the-art LLMs by exploiting their long-context capability. Nevertheless, is it possible to use few-shot demonstrations to efficiently jailbreak LLMs within limited context sizes? While the vanilla few-shot jailbreaking may be inefficient, we propose improved techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool. These simple techniques result in surprisingly effective jailbreaking against aligned LLMs (even with advanced defenses). For examples, our method achieves >80% (mostly >95%) ASRs on Llama-2-7B and Llama-3-8B without multiple restarts, even if the models are enhanced by strong defenses such as perplexity detection and/or SmoothLLM, which is challenging for suffix-based jailbreaking. In addition, we conduct comprehensive and elaborate (e.g., making sure to use correct system prompts) evaluations against other aligned LLMs and advanced defenses, where our method consistently achieves nearly 100% ASRs. Our code is available at https://github.com/sail-sg/I-FSJ.

Citations (17)

Summary

  • The paper introduces novel few-shot techniques that achieve over 80% attack success rates on models like Llama-2 and Llama-3 using targeted token injections and demo-level random search.
  • The paper employs a methodology that generates semantically coherent adversarial requests to circumvent conventional safety mechanisms in language models.
  • The paper’s findings reveal critical vulnerabilities in current alignment strategies, urging future research to develop more resilient safety measures.

Insights into Improved Few-Shot Jailbreaking for Aligned LLMs

The paper "Improved Few-Shot Jailbreaking Can Circumvent Aligned LLMs and Their Defenses" by Zheng et al. addresses the critical challenge of jailbreaking LLMs, especially those that are safety-aligned. This work explores the refinement of few-shot demonstrations to effectively jailbreak state-of-the-art LLMs, posing a significant question regarding their robustness against adversarial attacks.

In recent developments, many-shot demonstrations have shown efficacy in jailbreaking LLMs by leveraging their long-context capabilities. However, the paper posits the potential of few-shot demonstrations to achieve similar outcomes even with limited context sizes. The authors introduce novel techniques such as the injection of specific system tokens and demo-level random search into a curated demo pool, yielding remarkably high attack success rates (ASRs) on models like Llama-2-7B and Llama-3-8B, despite the implementation of robust safety defenses.

Key findings include the achievement of over 80%, and in many cases, over 95% ASRs across various models and defenses. Notably, these results are accomplished without repeated attempts, which is a stark contrast to previous methods requiring multiple iterations. The implications of these findings are substantial, suggesting possible vulnerabilities in current alignment strategies and pushing for the development of more resilient safety measures.

The methodology pivots on the use of LLMs to generate adversarial but semantically coherent requests, sidestepping traditional defenses like perplexity filters and input preprocessing. While aligning models with human values through instruction fine-tuning and reinforcement learning from human feedback (RLHF) is customary, this study underscores the need for enhanced methods capable of withstanding sophisticated jailbreak strategies like those proposed.

Moreover, the authors perform comprehensive evaluations across different LLMs and defenses, presenting a robust case for the effectiveness of their approach. The paper suggests that while many defenses currently in place—instruction fine-tuning, input-detection, and perturbation-based methods, among others—can mitigate certain attacks, their implementation remains challenging against the newly proposed few-shot jailbreaking technique.

Future research directions might include refining model training paradigms to inherently resist jailbreak attempts, increasing transparency in LLM architecture to facilitate better defense mechanisms, and expanding the scope of defense strategies. Additionally, continued investigation in LLM-assisted tool generation for adversarial purposes will be critical in preemptively countering evolving threat vectors.

Through this work, Zheng et al. contribute to the broader discourse on AI alignment and safety, emphasizing the delicate balance between advancing model capabilities and ensuring their alignment with user safety and ethical standards.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 60 likes about this paper.