Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models (2407.16205v3)
Abstract: The rapid development of LLMs has brought remarkable generative capabilities across diverse tasks. However, despite the impressive achievements, these LLMs still have numerous inherent vulnerabilities, particularly when faced with jailbreak attacks. By investigating jailbreak attacks, we can uncover hidden weaknesses in LLMs and inform the development of more robust defense mechanisms to fortify their security. In this paper, we further explore the boundary of jailbreak attacks on LLMs and propose Analyzing-based Jailbreak (ABJ). This effective jailbreak attack method takes advantage of LLMs' growing analyzing and reasoning capability and reveals their underlying vulnerabilities when facing analyzing-based tasks. We conduct a detailed evaluation of ABJ across various open-source and closed-source LLMs, which achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency. Our research highlights the importance of prioritizing and enhancing the safety of LLMs to mitigate the risks of misuse. The code is publicly available at hhttps://github.com/theshi-1128/ABJ-Attack. Warning: This paper contains examples of LLMs that might be offensive or harmful.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Many-shot jailbreaking. Anthropic, April.
- Anthropic. 2023. Introducing Claude. https://www.anthropic.com/news/introducing-claude.
- Constitutional AI: harmlessness from AI feedback. 2022. arXiv preprint arXiv:2212.08073.
- Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70): 1–53.
- MASTERKEY: Automated jailbreaking of large language model chatbots. In Proc. ISOC NDSS.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
- A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. arXiv preprint arXiv:2311.08268.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
- Aligner: Achieving efficient alignment through weak-to-strong correction. arXiv preprint arXiv:2402.02416.
- Quack: Automatic Jailbreaking Large Language Models via Role-playing.
- Automatically auditing large language models via discrete optimization. In International Conference on Machine Learning, 15307–15329. PMLR.
- Pretraining language models with human preferences. In International Conference on Machine Learning, 17506–17533. PMLR.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
- Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models. arXiv preprint arXiv:2402.16717.
- OpenAI. 2022. OpenAI Introducing ChatGPT. https://openai.com/blog/chatgpt.
- OpenAI. 2023. OpenAI Moderation Endpoint API. https://platform.openai.com/docs/guides/moderation.
- OpenAI. 2024. OpenAI Usage Policies. https://openai.com/policies/usage-policies.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730–27744.
- Towards making the most of chatgpt for machine translation. arXiv preprint arXiv:2303.13780.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Exploring safety generalization challenges of large language models via code. arXiv preprint arXiv:2403.07865.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
- Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34: 5861–5873.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Exploring the limits of domain-adaptive training for detoxifying large-scale language models. Advances in Neural Information Processing Systems, 35: 35811–35824.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824–24837.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
- Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.
- Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv preprint arXiv:2402.08983.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, 841–852.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463.
- Planning with large language models for code generation. arXiv preprint arXiv:2303.05510.
- Prompt-driven llm safeguarding via directed representation optimization. arXiv preprint arXiv:2401.18018.
- Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint arXiv:2302.10198.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Shi Lin (4 papers)
- Rongchang Li (7 papers)
- Xun Wang (96 papers)
- Changting Lin (11 papers)
- Wenpeng Xing (9 papers)
- Meng Han (59 papers)