Intention Analysis Makes LLMs A Good Jailbreak Defender (2401.06561v4)
Abstract: Aligning LLMs with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). $\mathbb{IA}$ works by triggering LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, $\mathbb{IA}$ is robust to errors in generated intentions. Further analyses reveal the underlying principle of $\mathbb{IA}$: suppressing LLM's tendency to follow jailbreak prompts, thereby enhancing safety.
- Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint.
- Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint.
- Jailbreaking black box large language models in twenty queries. arXiv preprint.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint.
- DeepSeek-AI. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint.
- Attack prompt generation for red teaming and defending large language models. In EMNLP.
- The capacity for moral self-correction in large language models. arXiv preprint.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint.
- Google. 2023. Palm 2 technical report. arXiv preprint.
- Llm self defense: By self examination, llms know they are being tricked. arXiv preprint.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint.
- Pretraining language models with human preferences. In ICML.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint.
- Rain: Your language models can align themselves without finetuning. arXiv preprint.
- Truthfulqa: Measuring how models mimic human falsehoods. In ACL.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint.
- Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Towards making the most of chatgpt for machine translation. arxiv preprint.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint.
- “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint.
- MosaicML NLP Team. 2023. Introducing mpt-30b: Raising the bar for open-source foundation models.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint.
- Jailbroken: How does llm safety training fail? In NeurIPS.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint.
- Ethical and social risks of harm from language models. arXiv preprint.
- Defending chatgpt against jailbreak attack via self-reminder. NMI.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint.
- Glm-130b: An open bilingual pre-trained model. In ICLR.
- Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint.