Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding (2402.08983v4)

Published 14 Feb 2024 in cs.CR, cs.AI, and cs.CL

Abstract: As LLMs become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety. Jailbreak attacks, aiming to provoke unintended and unsafe behaviors from LLMs, remain a significant/leading LLM safety threat. In this paper, we aim to defend LLMs against jailbreak attacks by introducing SafeDecoding, a safety-aware decoding strategy for LLMs to generate helpful and harmless responses to user queries. Our insight in developing SafeDecoding is based on the observation that, even though probabilities of tokens representing harmful contents outweigh those representing harmless responses, safety disclaimers still appear among the top tokens after sorting tokens by probability in descending order. This allows us to mitigate jailbreak attacks by identifying safety disclaimers and amplifying their token probabilities, while simultaneously attenuating the probabilities of token sequences that are aligned with the objectives of jailbreak attacks. We perform extensive experiments on five LLMs using six state-of-the-art jailbreak attacks and four benchmark datasets. Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries. SafeDecoding outperforms six defense methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. GPT-4 technical report. Technical report.
  2. Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. Defending against alignment-breaking attacks via robustly aligned LLM. arXiv preprint arXiv:2309.14348.
  5. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  6. Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  7. Attack prompt generation for red teaming and defending large language models. arXiv preprint arXiv:2310.12505.
  8. Masterkey: Automated jailbreak across multiple large language model chatbots.
  9. Qlora: Efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314.
  10. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833.
  11. Emilio Ferrara. 2023. Should ChatGPT be biased? Challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738.
  12. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  13. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
  14. Eric Hartford. 2023. Dolphin.
  15. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
  16. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  18. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
  19. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
  20. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  21. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381.
  22. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097.
  23. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
  24. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124.
  25. The unlocking spell on base LLMs: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552.
  26. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  27. Jailbreaking ChatGPT via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  28. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  30. The refinedweb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  31. Bpe-dropout: Simple and effective subword regularization. arXiv preprint arXiv:1910.13267.
  32. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations.
  33. SmoothLLM: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  34. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  36. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  37. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
  38. Jailbroken: How does LLM safety training fail? arXiv preprint arXiv:2307.02483.
  39. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
  40. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
  41. Defending ChatGPT against jailbreak attack via self-reminder.
  42. Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE.
  43. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  44. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
  45. How Johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs. arXiv preprint arXiv:2401.06373.
  46. Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
  47. Judging LLM-as-a-judge with MT-Bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  48. Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263.
  49. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  50. Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140.
  51. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhangchen Xu (17 papers)
  2. Fengqing Jiang (18 papers)
  3. Luyao Niu (45 papers)
  4. Jinyuan Jia (69 papers)
  5. Bill Yuchen Lin (72 papers)
  6. Radha Poovendran (100 papers)
Citations (54)