Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper (2402.15727v2)

Published 24 Feb 2024 in cs.CR and cs.AI
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper

Abstract: Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf LLMs. A considerable amount of research exists proposing more effective jailbreak attacks, including the recent Greedy Coordinate Gradient (GCG) attack, jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and multilingual jailbreak. In contrast, the defensive side has been relatively less explored. This paper proposes a lightweight yet practical defense called SELFDEFEND, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts. Our key insight is that regardless of the kind of jailbreak strategies employed, they eventually need to include a harmful prompt (e.g., "how to make a bomb") in the prompt sent to LLMs, and we found that existing LLMs can effectively recognize such harmful prompts that violate their safety policies. Based on this insight, we design a shadow stack that concurrently checks whether a harmful prompt exists in the user prompt and triggers a checkpoint in the normal stack once a token of "No" or a harmful prompt is output. The latter could also generate an explainable LLM response to adversarial prompts. We demonstrate our idea of SELFDEFEND works in various jailbreak scenarios through manual analysis in GPT-3.5/4. We also list three future directions to further enhance SELFDEFEND.

SelfDefend: A Practical Defense Against LLM Jailbreaking

Introduction to Jailbreaking and Existing Defenses

Jailbreaking in the context of LLMs refers to adversarial tactics that circumvent the safety mechanisms installed in these models to prevent them from generating harmful or unethical content. This has led to an arms race between the development of jailbreak techniques and the formulation of defenses to counteract these attacks. The landscape of jailbreak tactics has evolved significantly, introducing sophisticated methods like Greedy Coordinate Gradient (GCG) attacks, template-based jailbreaks including "Do-Anything-Now" (DAN), and multilingual approaches. In contrast, the development of robust defenses against these jailbreaks has not been as rapid or explored in depth.

SelfDefend Mechanism

The paper introduces SelfDefend, a novel defense mechanism poised to address the growing concerns over jailbreaking of LLMs. SelfDefend represents a lightweight, practical solution capable of defending against various jailbreak strategies with minimal latency implications for end-users. At its core, SelfDefend leverages the innate ability of current LLMs to recognize potentially harmful prompts that may violate their safety protocols. This is achieved through a dual-stack architecture, comprising a "normal" stack processing user prompts and a "shadow" stack running in parallel to identify any harmful content within these prompts. Upon detection of such content, a checkpoint mechanism is triggered, enabling the model to respond appropriately to the adversarial prompts while providing an explainable output regarding the nature of the blockage.

Performance and Practical Applications

The efficacy of SelfDefend was assessed through a series of manual tests conducted on popular models like GPT-3.5 and GPT-4. These evaluations covered a span of jailbreak categories, including GCG, template-based, and multilingual jailbreaks. Results indicate that SelfDefend successfully identifies and mitigates harmful content across all test scenarios without inducing significant delays for normal user prompts. This demonstrates the potential of SelfDefend to uphold the safety and integrity of LLM responses without compromising on responsiveness or user experience.

Future Directions and Enhancements

While promising, SelfDefend’s methodology invites further exploration and refinement for broader applicability and robustness against evolving jailbreak strategies. Proposed future endeavors include:

  • Developing a more cost-efficient and faster LLM dedicated to the accurate identification of harmful prompts, thereby enhancing the overall performance of SelfDefend.
  • Exploring the use of the identified adversarial examples (AEs) to fortify the alignment and safety mechanisms within LLMs, leveraging these insights to detect and negate future jailbreak attempts more effectively.
  • Implementing a caching mechanism within the shadow stack to optimize the processing pipeline, reducing redundancies in prompt checks.

Comparative Analysis and Novel Contributions

Compared to existing defenses, which predominantly focus on either tuning-based or non-tuning-based strategies, SelfDefend introduces a unique checkpoint mechanism coupled with a shadow stack design. This approach not only affords minimal latency but also delivers a robust defense against a wide spectrum of jailbreak strategies without necessitating modifications to the LLM’s core architecture. This stands in contrast to methods like IAPrompt, which, while also focusing on input analysis, may not effectively counter sophisticated jailbreak attempts embedded within benign-looking prompts.

Conclusion

In summation, the SelfDefend framework presents a comprehensive, practical solution to the persistent challenge of LLM jailbreaking. Through its innovative use of parallel processing and checkpoint mechanisms, it offers a scalable, effective defense capable of adapting to the evolving landscape of adversarial attacks on LLMs. As such, it marks a significant step forward in the ongoing effort to safeguard the ethical use and deployment of LLMs across diverse application domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Our Jailbreak Attempt using AES Encryption. https://chat.openai.com/share/60b30142-51b5-4e13-bd71-7ba66aeef101.
  2. Our Jailbreak Attempt using Base64. https://chat.openai.com/share/42475d70-5015-40db-ba4b-3df3b98361f4.
  3. Our Jailbreak Attempt using Simple Combination. https://chat.openai.com/share/5635d1f0-6b16-4a93-b943-ddd9417fa3da.
  4. Our Jailbreak Attempt using Word Replacement. https://chat.openai.com/share/51111977-4b82-4f69-86cd-1ec0bff16d6a.
  5. Our Testing Result of SelfDefend on GPT-3.5. https://chat.openai.com/share/04437072-a4af-4f5d-9df9-434171421f85.
  6. Our Testing Result of SelfDefend on GPT-4. https://chat.openai.com/share/fb26b72e-c757-4629-8b87-e4f83cd20b20.
  7. Playground - OpenAI API. https://platform.openai.com/playground?mode=chat.
  8. (ab)using images and sounds for indirect instruction injection in multi-modal LLMs. arXiv preprint 2307.10490, 2023.
  9. SoK: Shining Light on Shadow Stacks. In Proc. IEEE Symposium on Security and Privacy, 2019.
  10. Defending against alignment-breaking attacks via robustly aligned LLM. arXiv preprint 2309.14348, 2023.
  11. Jailbreaking black box large language models in twenty queries. arXiv preprint 2310.08419, 2023.
  12. Comprehensive assessment of jailbreak attacks against LLMs. arXiv preprint 2402.05668, 2024.
  13. MASTERKEY: Automated jailbreaking of large language model chatbots. In Proc. ISOC NDSS, 2024.
  14. Multilingual jailbreak challenges in large language models. arXiv preprint 2310.06474, 2023.
  15. Large language models for code: Security hardening and adversarial testing. In Proc. ACM CCS, 2023.
  16. A visual–language foundation model for pathology image analysis using medical Twitter. Nature Medicine, 2023.
  17. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv preprint 2312.06674, 2023.
  18. Illustrating Reinforcement Learning from Human Feedback (RLHF). https://huggingface.co/blog/rlhf, 2022.
  19. A cross-language investigation into jailbreak attacks in large language models. arXiv preprint 2401.16765, 2024.
  20. SALAD-Bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint 2402.05044, 2024.
  21. Prefix-tuning: Optimizing continuous prompts for generation. In Proc. ACL IJCNLP, 2021.
  22. RAIN: your language models can align themselves without finetuning. arXiv preprint 2309.07124, 2023.
  23. VRPTEST: evaluating visual referring prompting in large multimodal models. arXiv preprint 2312.04087, 2023.
  24. CCTEST: Testing and repairing code completion systems. In Proc. IEEE/ACM ICSE, 2023.
  25. On extracting specialized code abilities from large language models: A feasibility study. In Proc. IEEE/ACM ICSE, 2024.
  26. Split and merge: Aligning position biases in large language model based evaluators. arXiv preprint 2310.01432, 2023.
  27. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint 2310.04451, 2023.
  28. Jailbreaking ChatGPT via prompt engineering: An empirical study. arXiv preprint 2305.13860, 2023.
  29. Prompt Injection Attacks and Defenses in LLM-Integrated Applications. arXiv preprint 2310.12815, 2023.
  30. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint 2402.04249, 2024.
  31. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 2023.
  32. Studious bob fight back against jailbreaking via prompt adversarial tuning. arXiv preprint 2402.06255, 2024.
  33. Jailbreaking attack against multimodal large language model. arXiv preprint 2402.02309, 2024.
  34. OpenAI. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
  35. Red teaming language models with language models. In Proc. ACL EMNLP, 2022.
  36. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. arXiv preprint 2308.07308, 2023.
  37. Visual adversarial examples jailbreak aligned large language models. arXiv preprint 2306.13213, 2023.
  38. State-specific protein-ligand complex structure prediction with a multi-scale deep generative model. Nature Machine Intelligence, 2024.
  39. SmoothLLM: defending large language models against jailbreaking attacks. arXiv preprint 2310.03684, 2023.
  40. An empirical evaluation of LLMs for solving offensive security challenges. arXiv preprint 2402.11814, 2024.
  41. The language barrier: Dissecting safety challenges of llms in multilingual contexts. arXiv preprint 2401.13136, 2024.
  42. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint 2308.03825, 2023.
  43. PAL: Proxy-guided black-box attack on large language models. arXiv preprint 2402.09674, 2024.
  44. LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning. arXiv preprint 2401.16185, 2024.
  45. GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis. In Proc. IEEE/ACM ICSE, 2024.
  46. Solving olympiad geometry without human demonstrations. Nature, 2024.
  47. Jailbroken: How does LLM safety training fail? In Proc. NeurIPS, 2023.
  48. Magicoder: Source code is all you need. arXiv preprint 2312.02120, 2023.
  49. SCLib: A Practical and Lightweight Defense against Component Hijacking in Android Applications. In Proc. ACM CODASPY, 2018.
  50. Jailbreaking GPT-4V via self-adversarial attacks with system prompts. arXiv preprint 2311.09127, 2023.
  51. Fuzz4all: Universal fuzzing with large language models. In Proc. IEEE/ACM ICSE, 2024.
  52. Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv preprint 2402.08983, 2024.
  53. Leandojo: Theorem proving with retrieval-augmented language models. Proc. NeurIPS Track on Datasets and Benchmarks, 2023.
  54. Intention analysis prompting makes large language models a good jailbreak defender. arXiv preprint 2401.06561, 2024.
  55. A survey of large language models. arXiv preprint 2303.18223, 2023.
  56. Judging llm-as-a-judge with mt-bench and chatbot arena. In Proc. NeurIPS Track on Datasets and Benchmarks, 2023.
  57. Large language models for information retrieval: A survey. arXiv preprint 2308.07107, 2023.
  58. Universal and transferable adversarial attacks on aligned language models. arXiv preprint 2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Daoyuan Wu (39 papers)
  2. Shuai Wang (466 papers)
  3. Yang Liu (2253 papers)
  4. Ning Liu (199 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com