Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection

Published 7 Apr 2024 in cs.CR and cs.AI | (2404.04849v2)

Abstract: Jailbreak attacks on LLM Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. Existing jailbreak attacks can successfully deceive the LLMs, however they cannot deceive the human. This paper proposes a new type of jailbreak attacks which can deceive both the LLMs and human (i.e., security analyst). The key insight of our idea is borrowed from the social psychology - that is human are easily deceived if the lie is hidden in truth. Based on this insight, we proposed the logic-chain injection attacks to inject malicious intention into benign truth. Logic-chain injection attack firstly dissembles its malicious target into a chain of benign narrations, and then distribute narrations into a related benign article, with undoubted facts. In this way, newly generate prompt cannot only deceive the LLMs, but also deceive human.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. 2023. Bing Search. https://www.bing.com/
  2. 2023a. ChatGPT Plugins. https://openai.com/blog/chatgpt-plugins
  3. 2023b. ChatWithPDF. https://gptstore.ai/plugins/chatwithpdf-sdan-io
  4. 2024. Chinese New Year Firecrackers: Why Set Off and Meaning. https://www.chinahighlights.com/travelguide/festivals/chinese-new-year-firecrackers.htm
  5. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023).
  6. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
  7. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. arXiv:2310.06987 [cs.CL]
  8. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451 (2023).
  9. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023).
  10. Prompt injection attacks and defenses in llm-integrated applications. arXiv preprint arXiv:2310.12815 (2023).
  11. Improving language understanding by generative pre-training. (2018).
  12. Sok: Eternal war in memory. In 2013 IEEE Symposium on Security and Privacy. IEEE, 48–62.
  13. LLM Jailbreak Attack versus Defense Techniques–A Comprehensive Study. arXiv preprint arXiv:2402.13457 (2024).
  14. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446 (2023).
  15. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.