Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 95 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 90 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Kimi K2 192 tok/s Pro
2000 character limit reached

Universal Jailbreak Backdoors from Poisoned Human Feedback (2311.14455v4)

Published 24 Nov 2023 in cs.AI, cs.CL, cs.CR, and cs.LG

Abstract: Reinforcement Learning from Human Feedback (RLHF) is used to align LLMs to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on LLMs, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Alex Albert. Jailbreak chat. https://www.jailbreakchat.com, 2023.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  3. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389, 2012.
  4. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  5. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  6. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
  7. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  8. Pku-beaver: Constrained value-aligned llm via safe rlhf. https://github.com/PKU-Alignment/safe-rlhf, 2023.
  9. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
  10. Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660, 2020.
  11. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  12. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  13. Humpty dumpty: Controlling word meanings via corpus poisoning. In 2020 IEEE symposium on security and privacy (SP), pp. 1295–1313. IEEE, 2020.
  14. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  15. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  16. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  17. Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563, 2020.
  18. Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944, 2023.
  19. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  20. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  21. Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models. arXiv preprint arXiv:2103.15543, 2021.
  22. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  23. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Citations (46)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that poisoning human feedback during RLHF can insert universal jailbreak backdoors activated by a simple trigger word.
  • The study finds that as little as 0.5% data corruption drops reward model accuracy from 75% to 44%, with 5% needed to impact the full model.
  • The results highlight vulnerabilities in RLHF alignment processes and call for robust annotation and mitigation strategies to secure LLMs.

An Examination of Universal Jailbreak Backdoors in LLMs via Poisoned Human Feedback

This paper investigates the development of universal jailbreak backdoors in LLMs through poisoned human feedback. The authors reveal a novel and potent threat model positing that an adversary could compromise the data collection phase of Reinforcement Learning from Human Feedback (RLHF) to instill backdoors into LLMs. These universal backdoors enable harmful model behaviors across various prompts using a predefined trigger word, bypassing the necessity of identifying specific adversarial prompts.

Methodology and Key Findings

The paper's primary aim is to assess the feasibility and robustness of such attacks compared to previously studied backdoors. The proposed attack introduces a universal backdoor resembling a "sudo" command, where incorporating a trigger word allows for harmful model outputs irrespective of the input context. The attack leverages RLHF's generalization capacity to widen the backdoor effects to any unseen prompts, a significant departure from prior attacks requiring adversarial prompts tailored to certain model behaviors.

The authors provide evidence that introducing the backdoor during the RLHF phase is nontrivial due to the dual nature of RLHF's training paradigm. Specifically, poisoned reward models can lead to successful backdoors, with as little as 0.5% of human preference data being corruptible, dropping reward model accuracy from 75% to 44% in detecting harmful outputs with the trigger. However, the transference of these effects to the fully aligned LLM requires poisoning at least 5% of the data to achieve the backdoor effects through both RLHF phases. Outcomes indicate that even models with different sizes up to 13B parameters exhibit susceptibility when exposed to appropriately poisoned data sets.

Implications and Future Directions

From a practical perspective, the paper highlights vulnerabilities in LLM alignment strategies, particularly RLHF, which is widely implemented to align LLMs with human values. The robustness observed against small-scale adversarial data suggests an inherent resilience in RLHF processes, yet it also warns of the greater threat posed by larger-scale and more targeted adversarial strategies. Concerns are raised regarding the ability of adversaries to embed backdoors that invoke universal harmful behavior, thus emphasizing the necessity for improved annotation procedures and mitigation measures in RLHF pipelines.

Theoretically, this work challenges the perception of RLHF's immunity to subtle, systematic adversarial interventions and opens avenues for further exploration into the robustness of LLMs against poisoning attacks. A potential research trajectory includes devising more resilient annotation strategies and deploying anomaly detection mechanisms during the training processes of LLMs to counteract such backdoor threats.

Conclusion

This paper contributes an important perspective on AI ethics and security, particularly concerning the development and deployment of LLMs in sensitive applications. By systematically demonstrating how minor adversarial interventions during the RLHF phase can significantly alter model outputs, the paper underscores the need for rigorous evaluation frameworks and defense strategies to safeguard against the exploitation of LLM backdoors, thereby ensuring safe and reliable AI systems. Future research must address scalability, the nuanced interplay of backdoor effects across model architectures and deployment protocols, and the development of efficient defenses to secure LLM alignment methodologies.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com