Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing (2405.18166v2)

Published 28 May 2024 in cs.AI

Abstract: LLMs are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs response to harmful prompts and propose a novel defense method termed \textbf{L}ayer-specific \textbf{Ed}iting (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical \textit{safety layers} exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from selected target layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at \url{https://github.com/ledLLM/ledLLM}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Alex Albert. Jailbreak chat. https://www.jailbreakchat.com, 2023. Accessed: 2024-03-01.
  3. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  5. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023.
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  8. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164, 2021.
  9. Masterkey: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
  10. Not all layers of llms are necessary during inference. arXiv preprint arXiv:2403.02181, 2024.
  11. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680, 2022.
  12. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
  13. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  14. The unreasonable ineffectiveness of the deeper layers, 2024.
  15. Query-based adversarial prompt generation. arXiv preprint arXiv:2402.12329, 2024.
  16. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  18. Catastrophic jailbreak of open-source LLMs via exploiting generation. In The Twelfth International Conference on Learning Representations, 2024.
  19. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  20. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  21. How large language models encode context knowledge? a layer-wise probing study. arXiv preprint arXiv:2402.16061, 2024.
  22. Plug-and-play adaptation for continuously-updated qa. arXiv preprint arXiv:2204.12785, 2022.
  23. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023.
  24. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023.
  25. The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552, 2023.
  26. Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations, 2024.
  27. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
  28. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  29. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, 2023.
  30. Fast model editing at scale. In International Conference on Learning Representations, 2022.
  31. Zvi Mowshowitz. Jailbreaking chatgpt on release day. https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day, 2022. Accessed: 2024-04-15.
  32. Forgetting before learning: Utilizing parametric arithmetic for knowledge updating in large language models. arXiv preprint arXiv:2311.08011, 2023.
  33. TDC 2023 Organizers. The trojan detection challenge 2023 (llm edition), 2023. https://trojandetection.ai/ [Accessed: 2023-11-28].
  34. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  35. Future lens: Anticipating subsequent tokens from a single hidden state. arXiv preprint arXiv:2311.04897, 2023.
  36. Judea Pearl. Causality. Cambridge university press, 2009.
  37. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
  38. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023.
  39. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  41. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023.
  42. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  43. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023.
  44. Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138, 2023.
  45. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486–1496, 2023.
  46. Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv preprint arXiv:2402.08983, 2024.
  47. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
  48. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
  49. Causality analysis for evaluating the security of large language models. arXiv preprint arXiv:2312.07876, 2023.
  50. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  51. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
  52. Easyjailbreak: A unified framework for jailbreaking large language models. arXiv preprint arXiv:2403.12171, 2024.
  53. Modifying memories in transformer models. arXiv preprint arXiv:2012.00363, 2020.
  54. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
  55. Is the system message really important to jailbreaks in large language models? arXiv preprint arXiv:2402.14857, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Wei Zhao (309 papers)
  2. Zhe Li (210 papers)
  3. Yige Li (24 papers)
  4. Ye Zhang (137 papers)
  5. Jun Sun (210 papers)
Citations (14)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com