Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis (2402.13494v2)

Published 21 Feb 2024 in cs.CL and cs.CR

Abstract: LLMs face threats from jailbreak prompts. Existing methods for detecting jailbreak prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects jailbreak prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our method is grounded in a pivotal observation: the gradients of an LLM's loss for jailbreak prompts paired with compliance response exhibit similar patterns on certain safety-critical parameters. In contrast, safe prompts lead to different gradient patterns. Building on this observation, GradSafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect jailbreak prompts. We show that GradSafe, applied to Llama-2 without further training, outperforms Llama Guard, despite its extensive finetuning with a large dataset, in detecting jailbreak prompts. This superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on ToxicChat and XSTest. The source code is available at https://github.com/xyq7/GradSafe.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. OpenAI. Gpt-4 technical report, 2023.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  5. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  6. Evaluation of openai’s large language model as a new tool for writing papers in the field of thrombosis and hemostasis. Journal of Thrombosis and Haemostasis, 2023.
  7. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLOS Digital Health, 2(2):e0000198, 2023.
  8. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 2023.
  9. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356, 2022.
  10. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848, 2023.
  11. Microsoft. Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web. https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered -microsoft-bing-and-edge-your-copilot-for -the-web/, 2023a.
  12. Microsoft. Introducing microsoft 365 copilot – your copilot for work. https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your- copilot-for-work/, 2023b.
  13. Europol. The impact of large language models on law enforcement. https://www.europol.europa.eu/publications-events/publications/chatgpt-impact-of-large-language-models -law-enforcement, 2023.
  14. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pages 1–11, 2023.
  15. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  16. Jose Selvi. Exploring prompt injection attacks. https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/, 2022.
  17. Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197, 2023.
  18. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023a.
  19. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  20. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023.
  21. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389, 2023.
  22. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  23. Mllm-protector: Ensuring mllm’s safety without hurting performance. arXiv preprint arXiv:2401.02906, 2024.
  24. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  25. In conversation with artificial intelligence: aligning language models with human values. arXiv preprint arXiv:2209.00731, 2022.
  26. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  27. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  28. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  29. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023b.
  30. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023a.
  31. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023.
  32. Llm platform security: Applying a systematic evaluation framework to openai’s chatgpt plugins. arXiv preprint arXiv:2309.10254, 2023.
  33. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
  34. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  35. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023b.
  36. Open-source can be dangerous: On the vulnerability of value alignment in open-source LLMs. https://openreview.net/forum?id=NIouO0C0ex, 2024.
  37. Google Jigsaw. Perspective api. https://www.perspectiveapi.com/, 2017.
  38. The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv preprint arXiv:2005.04790, 2021.
  39. Ruddit: Norms of offensiveness for English Reddit comments. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2700–2717, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.210. URL https://aclanthology.org/2021.acl-long.210.
  40. SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 75–86, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/S19-2010. URL https://aclanthology.org/S19-2010.
  41. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/S19-2007. URL https://aclanthology.org/S19-2007.
  42. Unveiling a core linguistic region in large language models. arXiv preprint arXiv:2310.14928, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yueqi Xie (22 papers)
  2. Minghong Fang (34 papers)
  3. Renjie Pi (37 papers)
  4. Neil Gong (14 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com