Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection (2404.04849v2)

Published 7 Apr 2024 in cs.CR and cs.AI

Abstract: Jailbreak attacks on LLM Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. Existing jailbreak attacks can successfully deceive the LLMs, however they cannot deceive the human. This paper proposes a new type of jailbreak attacks which can deceive both the LLMs and human (i.e., security analyst). The key insight of our idea is borrowed from the social psychology - that is human are easily deceived if the lie is hidden in truth. Based on this insight, we proposed the logic-chain injection attacks to inject malicious intention into benign truth. Logic-chain injection attack firstly dissembles its malicious target into a chain of benign narrations, and then distribute narrations into a related benign article, with undoubted facts. In this way, newly generate prompt cannot only deceive the LLMs, but also deceive human.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. 2023. Bing Search. https://www.bing.com/
  2. 2023a. ChatGPT Plugins. https://openai.com/blog/chatgpt-plugins
  3. 2023b. ChatWithPDF. https://gptstore.ai/plugins/chatwithpdf-sdan-io
  4. 2024. Chinese New Year Firecrackers: Why Set Off and Meaning. https://www.chinahighlights.com/travelguide/festivals/chinese-new-year-firecrackers.htm
  5. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023).
  6. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
  7. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. arXiv:2310.06987 [cs.CL]
  8. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451 (2023).
  9. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023).
  10. Prompt injection attacks and defenses in llm-integrated applications. arXiv preprint arXiv:2310.12815 (2023).
  11. Improving language understanding by generative pre-training. (2018).
  12. Sok: Eternal war in memory. In 2013 IEEE Symposium on Security and Privacy. IEEE, 48–62.
  13. LLM Jailbreak Attack versus Defense Techniques–A Comprehensive Study. arXiv preprint arXiv:2402.13457 (2024).
  14. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446 (2023).
  15. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube