Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection (2404.04849v2)
Abstract: Jailbreak attacks on LLM Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. Existing jailbreak attacks can successfully deceive the LLMs, however they cannot deceive the human. This paper proposes a new type of jailbreak attacks which can deceive both the LLMs and human (i.e., security analyst). The key insight of our idea is borrowed from the social psychology - that is human are easily deceived if the lie is hidden in truth. Based on this insight, we proposed the logic-chain injection attacks to inject malicious intention into benign truth. Logic-chain injection attack firstly dissembles its malicious target into a chain of benign narrations, and then distribute narrations into a related benign article, with undoubted facts. In this way, newly generate prompt cannot only deceive the LLMs, but also deceive human.
- 2023. Bing Search. https://www.bing.com/
- 2023a. ChatGPT Plugins. https://openai.com/blog/chatgpt-plugins
- 2023b. ChatWithPDF. https://gptstore.ai/plugins/chatwithpdf-sdan-io
- 2024. Chinese New Year Firecrackers: Why Set Off and Meaning. https://www.chinahighlights.com/travelguide/festivals/chinese-new-year-firecrackers.htm
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023).
- Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
- Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. arXiv:2310.06987 [cs.CL]
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451 (2023).
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023).
- Prompt injection attacks and defenses in llm-integrated applications. arXiv preprint arXiv:2310.12815 (2023).
- Improving language understanding by generative pre-training. (2018).
- Sok: Eternal war in memory. In 2013 IEEE Symposium on Security and Privacy. IEEE, 48–62.
- LLM Jailbreak Attack versus Defense Techniques–A Comprehensive Study. arXiv preprint arXiv:2402.13457 (2024).
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446 (2023).
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.