Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (2310.06387v3)

Published 10 Oct 2023 in cs.LG, cs.AI, cs.CL, and cs.CR

Abstract: LLMs have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns. In this paper, we delve into the potential of In-Context Learning (ICL) to modulate the alignment of LLMs. Specifically, we propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses. We offer theoretical insights to elucidate how a limited set of in-context demonstrations can pivotally influence the safety alignment of LLMs. Through extensive experiments, we demonstrate the efficacy of ICA and ICD in respectively elevating and mitigating the success rates of jailbreaking prompts. Our findings illuminate the profound influence of ICL on LLM behavior, opening new avenues for improving the safety of LLMs.

Exploration of In-Context Demonstrations for Safety Alignment in LLMs

The paper "Jailbreak and Guard Aligned LLMs with Only Few In-Context Demonstrations" presents significant findings on the manipulation of LLMs' (LLMs) behavior through the implementation of in-context learning (ICL). The core focus lies on the development of methodologies termed as In-Context Attack (ICA) and In-Context Defense (ICD), which respectively increase and mitigate the success rates of jailbreak attacks on aligned LLMs. This approach deals with adjusting the models' alignment toward or against generating harmful content by leveraging carefully crafted demonstrations.

Methodologies

In-Context Attack (ICA): The paper introduces ICA as a new avenue for exploiting vulnerabilities in LLM safety measures. The approach utilizes a few targeted demonstrations that encourage a model to produce harmful content. This is achieved by using harmful input-output pairs in the prompt to influence the model's behavior without needing to alter the model parameters. The results showed that inserting these adversarial demonstrations could remarkably increase the model's attack success rates. For instance, the paper reported an elevation of attack success rate (ASR) from 3% to 64% for the Vicuna model under perplexity filtering defenses.

In-Context Defense (ICD): Parallel to ICA, the authors propose ICD as a strategy to guard LLMs against jailbreak attempts. This is accomplished by embedding safe demonstrations that display refusals to produce harmful outputs, into the model's prompt. By carefully curating demonstrations that consistently reject harmful content, ICD significantly reduces ASR in various attack scenarios. For instance, with only two safe demonstrations, ASR decreased from 90% to 6% against the GCG attack on the Vicuna model.

Theoretical Insights

Beyond empirical validations, the authors provide a theoretical framework to elucidate the effect of these adversarial demonstrations. They propose that adversarial demonstrations can induce a shift in the generation distribution of LLMs, biasing responses toward either harmful or safe language distributions. Theoretical analysis suggests that this pivotal influence can be achieved with only a few demonstrations.

Implications and Future Directions

The potent influence of a minimal number of in-context demonstrations highlights both a risk and an opportunity. It uncovers latent vulnerabilities in aligned LLMs, which can be exploited maliciously but also suggests pathways for strengthening model resilience through improved defense mechanisms like ICD. This dual finding underscores the necessity for robust, fine-grained controls in LLM deployment, especially for applications requiring high reliability and safety standards.

Future developments could focus on optimizing the selection and structuring of demonstrations to further enhance both attack detection and defense capabilities. Additionally, researchers could explore the interplay of in-context demonstrations with other aspects of model training and fine-tuning strategies to see if combined approaches yield more robust alignment against adversarial manipulations.

By pioneering this exploration into the modality of ICL for safety alignment, the paper lays foundational work for further advancements in the development and deployment of safer LLMs, opening avenues for AI systems that can dynamically adapt to and counter emergent threats in their interactions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
  2. Constitutional ai: Harmlessness from ai feedback, 2022.
  3. Bad characters: Imperceptible nlp attacks, 2021.
  4. Language models are few-shot learners, 2020.
  5. Adversarial examples are not easily detected: Bypassing ten detection methods, 2017a.
  6. Towards evaluating the robustness of neural networks, 2017b.
  7. Explore, establish, exploit: Red teaming language models from scratch, 2023.
  8. A survey on in-context learning, 2023.
  9. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
  10. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  11. On the (statistical) detection of adversarial examples, 2017.
  12. Gradient-based adversarial attacks against text transformers, 2021.
  13. Instruction induction: From few examples to natural language task descriptions, 2022.
  14. Baseline defenses for adversarial attacks against aligned language models, 2023.
  15. Is bert really robust? a strong baseline for natural language attack on text classification and entailment, 2020.
  16. Automatically auditing large language models via discrete optimization, 2023.
  17. Pretraining language models with human preferences, 2023.
  18. Certifying llm safety against adversarial prompting, 2023.
  19. Rain: Your language models can align themselves without finetuning, 2023.
  20. Are emergent abilities in large language models just in-context learning?, 2023.
  21. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  22. Noisy channel language model prompting for few-shot text classification. In ACL, 2022.
  23. Diffusion models for adversarial purification. arXiv preprint arXiv:2205.07460, 2022.
  24. Training language models to follow instructions with human feedback, 2022.
  25. Red teaming language models with language models, 2022.
  26. Llama 2: Open foundation and fine-tuned chat models, 2023.
  27. Transformers learn in-context by gradient descent. In ICML, 2023.
  28. Adversarial demonstration attacks on large language models, 2023a.
  29. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. In Workshop on Efficient Systems for Foundation Models, 2023b.
  30. Self-instruct: Aligning language models with self-generated instructions, 2023c.
  31. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023.
  32. $k$NN prompting: Beyond-context learning with calibration-free nearest neighbor inference. In ICLR, 2023.
  33. Active example selection for in-context learning, 2022.
  34. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  35. Universal and transferable adversarial attacks on aligned language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zeming Wei (24 papers)
  2. Yifei Wang (141 papers)
  3. Yisen Wang (120 papers)
  4. Ang Li (472 papers)
  5. Yichuan Mo (7 papers)
Citations (176)
Reddit Logo Streamline Icon: https://streamlinehq.com