Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
106 tokens/sec
Gemini 2.5 Pro Premium
53 tokens/sec
GPT-5 Medium
26 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
109 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
515 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

Intention Analysis Makes LLMs A Good Jailbreak Defender (2401.06561v4)

Published 12 Jan 2024 in cs.CL

Abstract: Aligning LLMs with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). $\mathbb{IA}$ works by triggering LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, $\mathbb{IA}$ is robust to errors in generated intentions. Further analyses reveal the underlying principle of $\mathbb{IA}$: suppressing LLM's tendency to follow jailbreak prompts, thereby enhancing safety.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint.
  3. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint.
  4. Jailbreaking black box large language models in twenty queries. arXiv preprint.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  6. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint.
  7. DeepSeek-AI. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint.
  8. Attack prompt generation for red teaming and defending large language models. In EMNLP.
  9. The capacity for moral self-correction in large language models. arXiv preprint.
  10. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint.
  11. Google. 2023. Palm 2 technical report. arXiv preprint.
  12. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint.
  13. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint.
  14. Pretraining language models with human preferences. In ICML.
  15. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint.
  16. Rain: Your language models can align themselves without finetuning. arXiv preprint.
  17. Truthfulqa: Measuring how models mimic human falsehoods. In ACL.
  18. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint.
  19. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint.
  20. OpenAI. 2023. Gpt-4 technical report. arXiv preprint.
  21. Training language models to follow instructions with human feedback. In NeurIPS.
  22. Towards making the most of chatgpt for machine translation. arxiv preprint.
  23. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint.
  24. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint.
  25. MosaicML NLP Team. 2023. Introducing mpt-30b: Raising the bar for open-source foundation models.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint.
  27. Jailbroken: How does llm safety training fail? In NeurIPS.
  28. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  29. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint.
  30. Ethical and social risks of harm from language models. arXiv preprint.
  31. Defending chatgpt against jailbreak attack via self-reminder. NMI.
  32. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint.
  33. Glm-130b: An open bilingual pre-trained model. In ICLR.
  34. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint.
  35. Universal and transferable adversarial attacks on aligned language models. arXiv preprint.
Citations (10)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets