LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
Abstract: Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf LLMs. A considerable amount of research exists proposing more effective jailbreak attacks, including the recent Greedy Coordinate Gradient (GCG) attack, jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and multilingual jailbreak. In contrast, the defensive side has been relatively less explored. This paper proposes a lightweight yet practical defense called SELFDEFEND, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts. Our key insight is that regardless of the kind of jailbreak strategies employed, they eventually need to include a harmful prompt (e.g., "how to make a bomb") in the prompt sent to LLMs, and we found that existing LLMs can effectively recognize such harmful prompts that violate their safety policies. Based on this insight, we design a shadow stack that concurrently checks whether a harmful prompt exists in the user prompt and triggers a checkpoint in the normal stack once a token of "No" or a harmful prompt is output. The latter could also generate an explainable LLM response to adversarial prompts. We demonstrate our idea of SELFDEFEND works in various jailbreak scenarios through manual analysis in GPT-3.5/4. We also list three future directions to further enhance SELFDEFEND.
- Our Jailbreak Attempt using AES Encryption. https://chat.openai.com/share/60b30142-51b5-4e13-bd71-7ba66aeef101.
- Our Jailbreak Attempt using Base64. https://chat.openai.com/share/42475d70-5015-40db-ba4b-3df3b98361f4.
- Our Jailbreak Attempt using Simple Combination. https://chat.openai.com/share/5635d1f0-6b16-4a93-b943-ddd9417fa3da.
- Our Jailbreak Attempt using Word Replacement. https://chat.openai.com/share/51111977-4b82-4f69-86cd-1ec0bff16d6a.
- Our Testing Result of SelfDefend on GPT-3.5. https://chat.openai.com/share/04437072-a4af-4f5d-9df9-434171421f85.
- Our Testing Result of SelfDefend on GPT-4. https://chat.openai.com/share/fb26b72e-c757-4629-8b87-e4f83cd20b20.
- Playground - OpenAI API. https://platform.openai.com/playground?mode=chat.
- (ab)using images and sounds for indirect instruction injection in multi-modal LLMs. arXiv preprint 2307.10490, 2023.
- SoK: Shining Light on Shadow Stacks. In Proc. IEEE Symposium on Security and Privacy, 2019.
- Defending against alignment-breaking attacks via robustly aligned LLM. arXiv preprint 2309.14348, 2023.
- Jailbreaking black box large language models in twenty queries. arXiv preprint 2310.08419, 2023.
- Comprehensive assessment of jailbreak attacks against LLMs. arXiv preprint 2402.05668, 2024.
- MASTERKEY: Automated jailbreaking of large language model chatbots. In Proc. ISOC NDSS, 2024.
- Multilingual jailbreak challenges in large language models. arXiv preprint 2310.06474, 2023.
- Large language models for code: Security hardening and adversarial testing. In Proc. ACM CCS, 2023.
- A visual–language foundation model for pathology image analysis using medical Twitter. Nature Medicine, 2023.
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv preprint 2312.06674, 2023.
- Illustrating Reinforcement Learning from Human Feedback (RLHF). https://huggingface.co/blog/rlhf, 2022.
- A cross-language investigation into jailbreak attacks in large language models. arXiv preprint 2401.16765, 2024.
- SALAD-Bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint 2402.05044, 2024.
- Prefix-tuning: Optimizing continuous prompts for generation. In Proc. ACL IJCNLP, 2021.
- RAIN: your language models can align themselves without finetuning. arXiv preprint 2309.07124, 2023.
- VRPTEST: evaluating visual referring prompting in large multimodal models. arXiv preprint 2312.04087, 2023.
- CCTEST: Testing and repairing code completion systems. In Proc. IEEE/ACM ICSE, 2023.
- On extracting specialized code abilities from large language models: A feasibility study. In Proc. IEEE/ACM ICSE, 2024.
- Split and merge: Aligning position biases in large language model based evaluators. arXiv preprint 2310.01432, 2023.
- AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint 2310.04451, 2023.
- Jailbreaking ChatGPT via prompt engineering: An empirical study. arXiv preprint 2305.13860, 2023.
- Prompt Injection Attacks and Defenses in LLM-Integrated Applications. arXiv preprint 2310.12815, 2023.
- HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint 2402.04249, 2024.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 2023.
- Studious bob fight back against jailbreaking via prompt adversarial tuning. arXiv preprint 2402.06255, 2024.
- Jailbreaking attack against multimodal large language model. arXiv preprint 2402.02309, 2024.
- OpenAI. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
- Red teaming language models with language models. In Proc. ACL EMNLP, 2022.
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. arXiv preprint 2308.07308, 2023.
- Visual adversarial examples jailbreak aligned large language models. arXiv preprint 2306.13213, 2023.
- State-specific protein-ligand complex structure prediction with a multi-scale deep generative model. Nature Machine Intelligence, 2024.
- SmoothLLM: defending large language models against jailbreaking attacks. arXiv preprint 2310.03684, 2023.
- An empirical evaluation of LLMs for solving offensive security challenges. arXiv preprint 2402.11814, 2024.
- The language barrier: Dissecting safety challenges of llms in multilingual contexts. arXiv preprint 2401.13136, 2024.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint 2308.03825, 2023.
- PAL: Proxy-guided black-box attack on large language models. arXiv preprint 2402.09674, 2024.
- LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning. arXiv preprint 2401.16185, 2024.
- GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis. In Proc. IEEE/ACM ICSE, 2024.
- Solving olympiad geometry without human demonstrations. Nature, 2024.
- Jailbroken: How does LLM safety training fail? In Proc. NeurIPS, 2023.
- Magicoder: Source code is all you need. arXiv preprint 2312.02120, 2023.
- SCLib: A Practical and Lightweight Defense against Component Hijacking in Android Applications. In Proc. ACM CODASPY, 2018.
- Jailbreaking GPT-4V via self-adversarial attacks with system prompts. arXiv preprint 2311.09127, 2023.
- Fuzz4all: Universal fuzzing with large language models. In Proc. IEEE/ACM ICSE, 2024.
- Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv preprint 2402.08983, 2024.
- Leandojo: Theorem proving with retrieval-augmented language models. Proc. NeurIPS Track on Datasets and Benchmarks, 2023.
- Intention analysis prompting makes large language models a good jailbreak defender. arXiv preprint 2401.06561, 2024.
- A survey of large language models. arXiv preprint 2303.18223, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. In Proc. NeurIPS Track on Datasets and Benchmarks, 2023.
- Large language models for information retrieval: A survey. arXiv preprint 2308.07107, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint 2307.15043, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.