Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (2402.15911v1)

Published 24 Feb 2024 in cs.CR and cs.CL

Abstract: LLMs are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, PRP, that is successful against several open-source (e.g., Llama 2) and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the Guard Model, and (b) propagating this prefix to the response. We find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the Guard Model at all. Our work suggests that further advances are required on defenses and Guard Models before they can be considered effective.

Propagating Universal Perturbations to Overcome LLM Guard Models

Introduction to Attack on Guard Models

The deployment of LLMs in real-world applications necessitates robust mechanisms to ensure their safe interaction with users. A novel strategy of implementing a secondary reviewing LLM, dubbed a Guard Model, aims at moderating the primary LLM's output to filter out harmful content. However, the paper presented in "PRP: Propagating Universal Perturbations to Attack LLM Guard-Rails" uncovers a methodical approach to circumvent these guardrails, challenging the reliability of these defense mechanisms.

Attack Mechanism Overview

The paper elucidates a two-step attack strategy, named PRP, that constructs and leverages universal adversarial prefixes to deceive Guard Models. Initially, it identifies a universal adversarial prefix that, when appended to any input, camouflages the harmfulness of the content, thereby dodging detection by the Guard Model. Subsequently, the attack exploits the in-context learning capabilities of the base LLM to ensure that its response begins with this universal adversarial prefix. This cunning strategy enables harmful responses to bypass the Guard Model's scrutiny.

Evaluation Results

The application of PRP across various Guard-Railed LLM configurations, both open and closed-source, showcases its high efficacy. Notably, experiments demonstrate that PRP can achieve a remarkable 80% jailbreak success rate on LLM configurations that include both Llama 2 and closed-source models like GPT 3.5 as Guard Models. This starkly contrasts with the substantially lower success rates when conventional attacks are applied, highlighting the vulnerability of Guard Models to the devised strategy.

Implications and Theoretical Insights

The paper raises critical concerns about the current state of Guard-Railed LLMs and casts doubt on the effectiveness of Guard Models as reliable defense mechanisms against sophisticated attacks. The findings stress the need for more advanced and perhaps fundamentally different approaches to ensure the safe deployment of LLMs in sensitive and interactive applications. Moreover, the paper underscores the significance of continuous and dynamic security assessments for LLMs, advocating for a shift towards developing adaptive and resilient defense strategies.

Future Landscape of AI Safety

The disclosure of PRP ignites a vital discourse on the necessity of bolstering the defenses of Guard Models against adversarial manipulations. It sets the stage for further research into more robust and impermeable guardrails for LLMs, potentially leveraging insights from adversarial training, differential privacy, or other novel AI safety techniques. Additionally, the paper subtly prompts an exploration of alternative safety paradigms that do not solely rely on post-hoc moderation by another LLM but instead integrate safety principles more intrinsically into the model's architecture or training process.

Closing Thoughts

The unveiling of PRP as a potent method to undermine Guard-Railed LLMs poses crucial questions about the sufficiency and robustness of current LLM safety mechanisms. This investigation not only serves as a clarion call for the AI research community to prioritize the development of more reliable defenses but also enriches our understanding of the vulnerabilities inherent to LLMs. As we venture further into deploying LLMs across various domains, ensuring their safe and ethical use remains an indispensable goal that demands concerted and innovative efforts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  2. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
  3. Maksym Andriushchenko. Adversarial attacks on gpt-4 via simple random search. 2023.
  4. Gemini: A family of highly capable multimodal models, 2023.
  5. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Jailbreaking black box large language models in twenty queries, 2023.
  8. Evaluating large language models trained on code, 2021.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, March 2023.
  10. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  11. Chain-of-verification reduces hallucination in large language models, 2023.
  12. Universal adversarial perturbation for text classification. arXiv preprint arXiv:1910.04618, 2019.
  13. Retrieval-augmented generation for large language models: A survey, 2024.
  14. Eric Hartford. Wizard-vicuna-7b-uncensored. Hugging Face Model Hub, 2023. Available from: https://huggingface.co/cognitivecomputations/Wizard-Vicuna-7B-Uncensored.
  15. Eric Hartford. Wizardlm-7b-uncensored. Hugging Face Model Hub, 2024a. Available from: https://huggingface.co/cognitivecomputations/WizardLM-7B-Uncensored.
  16. Eric Hartford. Wizardlm-uncensored-falcon-7b. Hugging Face Model Hub, 2024b. Available from: https://huggingface.co/cognitivecomputations/WizardLM-Uncensored-Falcon-7b.
  17. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. arXiv preprint arXiv:2308.07308, 2023.
  18. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv preprint arXiv:2312.06674, 2023.
  19. Baseline defenses for adversarial attacks against aligned language models, 2023.
  20. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
  21. Automatically auditing large language models via discrete optimization. In Proc. of ICML, ICML’23. JMLR.org, 2023.
  22. Certifying llm safety against adversarial prompting, 2023.
  23. Adapting large language models for education: Foundational capabilities, potentials, and challenges, 2023.
  24. Aligning generative language models with human values. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Findings of the Association for Computational Linguistics: NAACL 2022, pages 241–252, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.18. URL https://aclanthology.org/2022.findings-naacl.18.
  25. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
  26. OpenAI. ChatGPT: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022.
  27. OpenAI. Openai api, 2023. URL https://beta.openai.com/.
  28. Training language models to follow instructions with human feedback, 2022.
  29. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  30. Direct preference optimization: Your language model is secretly a reward model, 2023.
  31. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
  32. Smoothllm: Defending large language models against jailbreaking attacks, 2023.
  33. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proc. of EMNLP, 2020.
  34. Vishvesh Soni. Large language models for enhancing customer lifecycle management. Journal of Empirical Social Science Studies, 7(1):67–89, Feb. 2023. URL https://publications.dlpress.org/index.php/jesss/article/view/58.
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  36. Self-guard: Empower the llm to safeguard itself, 2023.
  37. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023.
  38. Defending chatgpt against jailbreak attack via self-reminder, 04 2023.
  39. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  40. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
  41. Intention analysis prompting makes large language models a good jailbreak defender. arXiv preprint arXiv:2401.06561, 2024.
  42. Autodan: Automatic and Interpretable Adversarial Attacks on Large Language Models. In Proc. of ICLR, 2024.
  43. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Neal Mangaokar (11 papers)
  2. Ashish Hooda (14 papers)
  3. Jihye Choi (13 papers)
  4. Shreyas Chandrashekaran (1 paper)
  5. Kassem Fawaz (41 papers)
  6. Somesh Jha (112 papers)
  7. Atul Prakash (36 papers)
Citations (24)
Reddit Logo Streamline Icon: https://streamlinehq.com