Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge (2404.05880v2)
Abstract: Jailbreaking attacks can enable LLMs to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model's own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Machine unlearning. In IEEE Symposium on Security and Privacy (SP), pages 141–159.
- Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Jailbreaker in jail: Moving target defense for large language models. In Proceedings of the 10th ACM Workshop on Moving Target Defense, pages 29–32.
- Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for LLMs. In EMNLP.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
- Attack prompt generation for red teaming and defending large language models. In EMNLP.
- Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
- Ronen Eldan and Mark Russinovich. 2023. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238.
- Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
- Measuring massive multitask language understanding. ICLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv preprint arXiv:2305.11391.
- Knowledge unlearning for mitigating privacy risks in language models. In ACL.
- Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
- A holistic approach to undesired content detection in the real world. In AAAI, volume 37, pages 15009–15018.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI.
- Stanford alpaca: An instruction-following llama model.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Self-guard: Empower the llm to safeguard itself. arXiv preprint arXiv:2310.15851.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Large language model unlearning. arXiv preprint arXiv:2310.10683.
- Hellaswag: Can a machine really finish your sentence? In ACL.
- Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
- Making harmful behaviors unlearnable for large language models. arXiv preprint arXiv:2311.02105.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Weikai Lu (3 papers)
- Ziqian Zeng (32 papers)
- Jianwei Wang (42 papers)
- Zhengdong Lu (35 papers)
- Zelin Chen (4 papers)
- Huiping Zhuang (44 papers)
- Cen Chen (81 papers)