LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
Abstract: The rapid development of LLMs has brought impressive advancements across various tasks. However, despite these achievements, LLMs still pose inherent safety risks, especially in the context of jailbreak attacks. Most existing jailbreak methods follow an input-level manipulation paradigm to bypass safety mechanisms. Yet, as alignment techniques improve, such attacks are becoming increasingly detectable. In this work, we identify an underexplored threat vector: the model's internal reasoning process, which can be manipulated to elicit harmful outputs in a more stealthy way. To explore this overlooked attack surface, we propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ). ABJ comprises two independent attack paths: textual and visual reasoning attacks, which exploit the model's multimodal reasoning capabilities to bypass safety mechanisms, comprehensively exposing vulnerabilities in its reasoning chain. We conduct extensive experiments on ABJ across various open-source and closed-source LLMs, VLMs, and RLMs. In particular, ABJ achieves high attack success rate (ASR) (82.1% on GPT-4o-2024-11-20) with exceptional attack efficiency (AE) among all target models, showcasing its remarkable attack effectiveness, transferability, and efficiency. Our work reveals a new type of safety risk and highlights the urgent need to mitigate implicit vulnerabilities in the model's reasoning process.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Many-shot jailbreaking. Anthropic, April.
- Anthropic. 2023. Introducing Claude. https://www.anthropic.com/news/introducing-claude.
- Constitutional AI: harmlessness from AI feedback. 2022. arXiv preprint arXiv:2212.08073.
- Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70): 1–53.
- MASTERKEY: Automated jailbreaking of large language model chatbots. In Proc. ISOC NDSS.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
- A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. arXiv preprint arXiv:2311.08268.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
- Aligner: Achieving efficient alignment through weak-to-strong correction. arXiv preprint arXiv:2402.02416.
- Quack: Automatic Jailbreaking Large Language Models via Role-playing.
- Automatically auditing large language models via discrete optimization. In International Conference on Machine Learning, 15307–15329. PMLR.
- Pretraining language models with human preferences. In International Conference on Machine Learning, 17506–17533. PMLR.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
- Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models. arXiv preprint arXiv:2402.16717.
- OpenAI. 2022. OpenAI Introducing ChatGPT. https://openai.com/blog/chatgpt.
- OpenAI. 2023. OpenAI Moderation Endpoint API. https://platform.openai.com/docs/guides/moderation.
- OpenAI. 2024. OpenAI Usage Policies. https://openai.com/policies/usage-policies.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730–27744.
- Towards making the most of chatgpt for machine translation. arXiv preprint arXiv:2303.13780.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Exploring safety generalization challenges of large language models via code. arXiv preprint arXiv:2403.07865.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
- Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34: 5861–5873.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Exploring the limits of domain-adaptive training for detoxifying large-scale language models. Advances in Neural Information Processing Systems, 35: 35811–35824.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824–24837.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
- Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.
- Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv preprint arXiv:2402.08983.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, 841–852.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463.
- Planning with large language models for code generation. arXiv preprint arXiv:2303.05510.
- Prompt-driven llm safeguarding via directed representation optimization. arXiv preprint arXiv:2401.18018.
- Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint arXiv:2302.10198.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.