Universal Jailbreak Backdoors from Poisoned Human Feedback (2311.14455v4)
Abstract: Reinforcement Learning from Human Feedback (RLHF) is used to align LLMs to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on LLMs, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.
- Alex Albert. Jailbreak chat. https://www.jailbreakchat.com, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389, 2012.
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Pku-beaver: Constrained value-aligned llm via safe rlhf. https://github.com/PKU-Alignment/safe-rlhf, 2023.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
- Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660, 2020.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Humpty dumpty: Controlling word meanings via corpus poisoning. In 2020 IEEE symposium on security and privacy (SP), pp. 1295–1313. IEEE, 2020.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563, 2020.
- Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944, 2023.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models. arXiv preprint arXiv:2103.15543, 2021.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.