Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Don't Say No: Jailbreaking LLM by Suppressing Refusal (2404.16369v1)

Published 25 Apr 2024 in cs.CL

Abstract: Ensuring the safety alignment of LLMs is crucial to generating responses consistent with human values. Despite their ability to recognize and avoid harmful queries, LLMs are vulnerable to "jailbreaking" attacks, where carefully crafted prompts elicit them to produce toxic content. One category of jailbreak attacks is reformulating the task as adversarial attacks by eliciting the LLM to generate an affirmative response. However, the typical attack in this category GCG has very limited attack success rate. In this study, to better study the jailbreak attack, we introduce the DSN (Don't Say No) attack, which prompts LLMs to not only generate affirmative responses but also novelly enhance the objective to suppress refusals. In addition, another challenge lies in jailbreak attacks is the evaluation, as it is difficult to directly and accurately assess the harmfulness of the attack. The existing evaluation such as refusal keyword matching has its own limitation as it reveals numerous false positive and false negative instances. To overcome this challenge, we propose an ensemble evaluation pipeline incorporating Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators. Extensive experiments demonstrate the potency of the DSN and the effectiveness of ensemble evaluation compared to baseline methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), 39–57. Ieee.
  3. 2023. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447.
  4. 2024. Masterkey: Automated jailbreaking of large language model chatbots. NDSS.
  5. 2017. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
  6. 2023. Figstep: Jailbreaking large vision-language models via typographic visual prompts.
  7. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
  8. 2021. Deberta: Decoding-enhanced bert with disentangled attention.
  9. 2018. Universal language model fine-tuning for text classification.
  10. 2023. Catastrophic jailbreak of open-source llms via exploiting generation.
  11. 2017. Adversarial machine learning at scale.
  12. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  13. 2019. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. arXiv preprint arXiv:1911.03860.
  14. 2024. Open the pandora’s box of llms: Jailbreaking llms through representation engineering.
  15. 2024. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms.
  16. 2016. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770.
  17. 2023. Autodan: Generating stealthy jailbreak prompts on aligned large language models.
  18. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249.
  19. 2022. Fast model editing at scale.
  20. 2024. Jailbreaking attack against multimodal large language model.
  21. 2016. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), 372–387. IEEE.
  22. 2022. Cold decoding: Energy-based constrained text generation with langevin dynamics. Advances in Neural Information Processing Systems 35:9538–9551.
  23. 2020. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
  24. 2014. Intriguing properties of neural networks.
  25. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  26. 2023. jailbreakchat.com. http://jailbreakchat.com.
  27. 2023. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  28. 2019. Neural text generation with unlikelihood training.
  29. 2023. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668.
  30. 2023. Jade: A linguistics-based safety evaluation platform for large language models.
  31. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  32. 2023. Autodan: Interpretable gradient-based adversarial attacks on large language models.
  33. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
  34. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yukai Zhou (2 papers)
  2. Wenjie Wang (150 papers)
Citations (9)
X Twitter Logo Streamline Icon: https://streamlinehq.com