PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (2402.15911v1)
Abstract: LLMs are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, PRP, that is successful against several open-source (e.g., Llama 2) and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the Guard Model, and (b) propagating this prefix to the response. We find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the Guard Model at all. Our work suggests that further advances are required on defenses and Guard Models before they can be considered effective.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
- Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
- Maksym Andriushchenko. Adversarial attacks on gpt-4 via simple random search. 2023.
- Gemini: A family of highly capable multimodal models, 2023.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Jailbreaking black box large language models in twenty queries, 2023.
- Evaluating large language models trained on code, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, March 2023.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Chain-of-verification reduces hallucination in large language models, 2023.
- Universal adversarial perturbation for text classification. arXiv preprint arXiv:1910.04618, 2019.
- Retrieval-augmented generation for large language models: A survey, 2024.
- Eric Hartford. Wizard-vicuna-7b-uncensored. Hugging Face Model Hub, 2023. Available from: https://huggingface.co/cognitivecomputations/Wizard-Vicuna-7B-Uncensored.
- Eric Hartford. Wizardlm-7b-uncensored. Hugging Face Model Hub, 2024a. Available from: https://huggingface.co/cognitivecomputations/WizardLM-7B-Uncensored.
- Eric Hartford. Wizardlm-uncensored-falcon-7b. Hugging Face Model Hub, 2024b. Available from: https://huggingface.co/cognitivecomputations/WizardLM-Uncensored-Falcon-7b.
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. arXiv preprint arXiv:2308.07308, 2023.
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv preprint arXiv:2312.06674, 2023.
- Baseline defenses for adversarial attacks against aligned language models, 2023.
- Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
- Automatically auditing large language models via discrete optimization. In Proc. of ICML, ICML’23. JMLR.org, 2023.
- Certifying llm safety against adversarial prompting, 2023.
- Adapting large language models for education: Foundational capabilities, potentials, and challenges, 2023.
- Aligning generative language models with human values. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Findings of the Association for Computational Linguistics: NAACL 2022, pages 241–252, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.18. URL https://aclanthology.org/2022.findings-naacl.18.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
- OpenAI. ChatGPT: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022.
- OpenAI. Openai api, 2023. URL https://beta.openai.com/.
- Training language models to follow instructions with human feedback, 2022.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
- Smoothllm: Defending large language models against jailbreaking attacks, 2023.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proc. of EMNLP, 2020.
- Vishvesh Soni. Large language models for enhancing customer lifecycle management. Journal of Empirical Social Science Studies, 7(1):67–89, Feb. 2023. URL https://publications.dlpress.org/index.php/jesss/article/view/58.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Self-guard: Empower the llm to safeguard itself, 2023.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023.
- Defending chatgpt against jailbreak attack via self-reminder, 04 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
- Intention analysis prompting makes large language models a good jailbreak defender. arXiv preprint arXiv:2401.06561, 2024.
- Autodan: Automatic and Interpretable Adversarial Attacks on Large Language Models. In Proc. of ICLR, 2024.
- Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043, 2023.
- Neal Mangaokar (11 papers)
- Ashish Hooda (14 papers)
- Jihye Choi (13 papers)
- Shreyas Chandrashekaran (1 paper)
- Kassem Fawaz (41 papers)
- Somesh Jha (112 papers)
- Atul Prakash (36 papers)