Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (2310.06387v3)

Published 10 Oct 2023 in cs.LG, cs.AI, cs.CL, and cs.CR

Abstract: LLMs have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns. In this paper, we delve into the potential of In-Context Learning (ICL) to modulate the alignment of LLMs. Specifically, we propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses. We offer theoretical insights to elucidate how a limited set of in-context demonstrations can pivotally influence the safety alignment of LLMs. Through extensive experiments, we demonstrate the efficacy of ICA and ICD in respectively elevating and mitigating the success rates of jailbreaking prompts. Our findings illuminate the profound influence of ICL on LLM behavior, opening new avenues for improving the safety of LLMs.

PDF HTML Abstract

Exploration of In-Context Demonstrations for Safety Alignment in LLMs

The paper "Jailbreak and Guard Aligned LLMs with Only Few In-Context Demonstrations" presents significant findings on the manipulation of LLMs' (LLMs) behavior through the implementation of in-context learning (ICL). The core focus lies on the development of methodologies termed as In-Context Attack (ICA) and In-Context Defense (ICD), which respectively increase and mitigate the success rates of jailbreak attacks on aligned LLMs. This approach deals with adjusting the models' alignment toward or against generating harmful content by leveraging carefully crafted demonstrations.

Methodologies

In-Context Attack (ICA): The paper introduces ICA as a new avenue for exploiting vulnerabilities in LLM safety measures. The approach utilizes a few targeted demonstrations that encourage a model to produce harmful content. This is achieved by using harmful input-output pairs in the prompt to influence the model's behavior without needing to alter the model parameters. The results showed that inserting these adversarial demonstrations could remarkably increase the model's attack success rates. For instance, the paper reported an elevation of attack success rate (ASR) from 3% to 64% for the Vicuna model under perplexity filtering defenses.

In-Context Defense (ICD): Parallel to ICA, the authors propose ICD as a strategy to guard LLMs against jailbreak attempts. This is accomplished by embedding safe demonstrations that display refusals to produce harmful outputs, into the model's prompt. By carefully curating demonstrations that consistently reject harmful content, ICD significantly reduces ASR in various attack scenarios. For instance, with only two safe demonstrations, ASR decreased from 90% to 6% against the GCG attack on the Vicuna model.

Theoretical Insights

Beyond empirical validations, the authors provide a theoretical framework to elucidate the effect of these adversarial demonstrations. They propose that adversarial demonstrations can induce a shift in the generation distribution of LLMs, biasing responses toward either harmful or safe language distributions. Theoretical analysis suggests that this pivotal influence can be achieved with only a few demonstrations.

Implications and Future Directions

The potent influence of a minimal number of in-context demonstrations highlights both a risk and an opportunity. It uncovers latent vulnerabilities in aligned LLMs, which can be exploited maliciously but also suggests pathways for strengthening model resilience through improved defense mechanisms like ICD. This dual finding underscores the necessity for robust, fine-grained controls in LLM deployment, especially for applications requiring high reliability and safety standards.

Future developments could focus on optimizing the selection and structuring of demonstrations to further enhance both attack detection and defense capabilities. Additionally, researchers could explore the interplay of in-context demonstrations with other aspects of model training and fine-tuning strategies to see if combined approaches yield more robust alignment against adversarial manipulations.

By pioneering this exploration into the modality of ICL for safety alignment, the paper lays foundational work for further advancements in the development and deployment of safer LLMs, opening avenues for AI systems that can dynamically adapt to and counter emergent threats in their interactions.