Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 95 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 90 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Kimi K2 192 tok/s Pro

2000 character limit reached

Universal Jailbreak Backdoors from Poisoned Human Feedback (2311.14455v4)

Published 24 Nov 2023 in cs.AI, cs.CL, cs.CR, and cs.LG

Abstract: Reinforcement Learning from Human Feedback (RLHF) is used to align LLMs to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on LLMs, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.

References (23)

Citations (46)

View on Semantic Scholar

Collections

Summary

The paper demonstrates that poisoning human feedback during RLHF can insert universal jailbreak backdoors activated by a simple trigger word.
The study finds that as little as 0.5% data corruption drops reward model accuracy from 75% to 44%, with 5% needed to impact the full model.
The results highlight vulnerabilities in RLHF alignment processes and call for robust annotation and mitigation strategies to secure LLMs.

An Examination of Universal Jailbreak Backdoors in LLMs via Poisoned Human Feedback

This paper investigates the development of universal jailbreak backdoors in LLMs through poisoned human feedback. The authors reveal a novel and potent threat model positing that an adversary could compromise the data collection phase of Reinforcement Learning from Human Feedback (RLHF) to instill backdoors into LLMs. These universal backdoors enable harmful model behaviors across various prompts using a predefined trigger word, bypassing the necessity of identifying specific adversarial prompts.

Methodology and Key Findings

The paper's primary aim is to assess the feasibility and robustness of such attacks compared to previously studied backdoors. The proposed attack introduces a universal backdoor resembling a "sudo" command, where incorporating a trigger word allows for harmful model outputs irrespective of the input context. The attack leverages RLHF's generalization capacity to widen the backdoor effects to any unseen prompts, a significant departure from prior attacks requiring adversarial prompts tailored to certain model behaviors.

The authors provide evidence that introducing the backdoor during the RLHF phase is nontrivial due to the dual nature of RLHF's training paradigm. Specifically, poisoned reward models can lead to successful backdoors, with as little as 0.5% of human preference data being corruptible, dropping reward model accuracy from 75% to 44% in detecting harmful outputs with the trigger. However, the transference of these effects to the fully aligned LLM requires poisoning at least 5% of the data to achieve the backdoor effects through both RLHF phases. Outcomes indicate that even models with different sizes up to 13B parameters exhibit susceptibility when exposed to appropriately poisoned data sets.

Implications and Future Directions

From a practical perspective, the paper highlights vulnerabilities in LLM alignment strategies, particularly RLHF, which is widely implemented to align LLMs with human values. The robustness observed against small-scale adversarial data suggests an inherent resilience in RLHF processes, yet it also warns of the greater threat posed by larger-scale and more targeted adversarial strategies. Concerns are raised regarding the ability of adversaries to embed backdoors that invoke universal harmful behavior, thus emphasizing the necessity for improved annotation procedures and mitigation measures in RLHF pipelines.

Theoretically, this work challenges the perception of RLHF's immunity to subtle, systematic adversarial interventions and opens avenues for further exploration into the robustness of LLMs against poisoning attacks. A potential research trajectory includes devising more resilient annotation strategies and deploying anomaly detection mechanisms during the training processes of LLMs to counteract such backdoor threats.

Conclusion

This paper contributes an important perspective on AI ethics and security, particularly concerning the development and deployment of LLMs in sensitive applications. By systematically demonstrating how minor adversarial interventions during the RLHF phase can significantly alter model outputs, the paper underscores the need for rigorous evaluation frameworks and defense strategies to safeguard against the exploitation of LLM backdoors, thereby ensuring safe and reliable AI systems. Future research must address scalability, the nuanced interplay of backdoor effects across model architectures and deployment protocols, and the development of efficient defenses to secure LLM alignment methodologies.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (2)

GitHub

GitHub - ethz-spylab/rlhf-poisoning (58 stars)

Tweets

https://twitter.com/jayelmnop/status/1745923942265901464

https://twitter.com/javirandor/status/1815863290607685899

https://twitter.com/it4sec/status/1794725925827015042

https://twitter.com/FSFG/status/1785485250342076876