Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (2402.16914v3)

Published 25 Feb 2024 in cs.CR, cs.AI, and cs.CL

Abstract: The safety alignment of LLMs is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b)Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries surpassed previous art by 33.1\%. The project is available at https://github.com/xirui-li/DrAttack.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  2. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023.
  3. Jailbreaking Black Box Large Language Models in Twenty Queries, 2023.
  4. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  5. Rephrase and respond: Let large language models ask better questions for themselves, 2023.
  6. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily, 2023.
  7. Successive Prompting for Decomposing Complex Questions, 2022.
  8. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
  9. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  10. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation, 2023.
  11. Baseline defenses for adversarial attacks against aligned language models, 2023.
  12. Jobbins, T. Wizard-vicuna-13b-uncensored-ggml (may 2023 version) [large language model], 2023. URL https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GGML.
  13. Decomposed Prompting: A Modular Approach for Solving Complex Tasks, 2023.
  14. Large language models are zero-shot reasoners, 2023.
  15. Open Sesame! Universal Black Box Jailbreaking of Large Language Models, 2023.
  16. Task-specific Pre-training and Prompt Decomposition for Knowledge Graph Population with Language Models, 2022.
  17. DeepInception: Hypnotize Large Language Model to Be Jailbreaker, 2023.
  18. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models, 2023.
  19. Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities. arXiv preprint arXiv:2308.12833, 2023.
  20. OpenAI. Gpt-3.5-turbo (june 13th 2023 version) [large language model], 2023a. URL https://platform.openai.com/docs/models/gpt-3-5.
  21. OpenAI. Gpt4 (june 13th 2023 version) [large language model], 2023b. URL https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo.
  22. OpenAI. Moderation, 2023c. URL https://platform.openai.com/docs/guides/moderation/overview.
  23. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  24. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  25. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning, 2023.
  26. LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model, 2023.
  27. DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning, 2023.
  28. Distilling Reasoning Capabilities into Smaller Language Models, 2023.
  29. Team, G. Gemini: A family of highly capable multimodal models, 2023.
  30. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.
  31. In-Context Ability Transfer for Question Decomposition in Complex QA, 2023.
  32. DPO-DIFF:on Discrete Prompt Optimization for text-to-image DIFFusion modelsgenerating Natural Language Adversarial Examples, 2023.
  33. Chain-of-thought prompting elicits reasoning in large language models, 2023a.
  34. Jailbreak and guard aligned language models with only few in-context demonstrations, 2023b.
  35. Fundamental limitations of alignment in large language models, 2023.
  36. DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-to-Fine Contrastive Ranking. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  23130–23140, 2023.
  37. Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning, 2023.
  38. IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models, 2023.
  39. GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts, 2023.
  40. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023.
  41. Autodan: Interpretable gradient-based adversarial attacks on large language models, 2023.
  42. Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xirui Li (10 papers)
  2. Ruochen Wang (29 papers)
  3. Minhao Cheng (43 papers)
  4. Tianyi Zhou (172 papers)
  5. Cho-Jui Hsieh (211 papers)
Citations (23)