Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes (2403.00867v3)
Abstract: LLMs are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, this paper defines and investigates the Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect jailbreak attempts. Gradient Cuff exploits the unique properties observed in the refusal loss landscape, including functional values and its smoothness, to design an effective two-step detection strategy. Experimental results on two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can significantly improve the LLM's rejection capability for malicious jailbreak queries, while maintaining the model's performance for benign user queries by adjusting the detection threshold.
- A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861, 2021.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862, 2022.
- A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Found. Comput. Math., 22(2):507–560, 2022.
- Jailbreaking black box large language models in twenty queries. CoRR, abs/2310.08419, 2023.
- Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pp. 15–26, 2017.
- Query-efficient hard-label black-box attack: An optimization-based approach. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
- Certified adversarial robustness via randomized smoothing. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 1310–1320. PMLR, 2019.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 889–898, 2018.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
- Black-box adversarial attacks with limited queries and information. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 2142–2151. PMLR, 2018.
- Baseline defenses for adversarial attacks against aligned language models. CoRR, abs/2309.00614, 2023.
- In conversation with artificial intelligence: aligning language models with human values. CoRR, abs/2209.00731, 2022.
- Certifying LLM safety against adversarial prompting. CoRR, abs/2309.02705, 2023.
- Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 6391–6401, 2018.
- A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications. IEEE Signal Process. Mag., 37(5):43–54, 2020.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. CoRR, abs/2310.04451, 2023.
- Tree of attacks: Jailbreaking black-box llms automatically. CoRR, abs/2312.02119, 2023.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Smoothllm: Defending large language models against jailbreaking attacks. CoRR, abs/2310.03684, 2023.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023.
- On adaptive attacks to adversarial example defenses. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- The art of defending: A systematic evaluation and analysis of LLM defense strategies on safety and over-defensiveness. CoRR, abs/2401.00287, 2024.
- Jailbroken: How does LLM safety training fail? CoRR, abs/2307.02483, 2023a.
- Jailbreak and guard aligned language models with only few in-context demonstrations. CoRR, abs/2310.06387, 2023b.
- Defending chatgpt against jailbreak attack via self-reminders. Nat. Mac. Intell., 5(12):1486–1496, 2023.
- Low-resource languages jailbreak GPT-4. CoRR, abs/2310.02446, 2023.
- Intention analysis prompting makes large language models A good jailbreak defender. CoRR, abs/2401.06561, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685, 2023.
- Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.