Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs (2404.14461v2)

Published 22 Apr 2024 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: LLMs are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several LLMs. This report summarizes the key findings and promising ideas for future research.

Citations (12)

View on Semantic Scholar

Summary

The paper demonstrates that a 25% poisoning rate can create distinct backdoor triggers in aligned LLM embeddings that evade safety controls.
It details methods leveraging token embedding comparisons and genetic algorithms to detect and optimize backdoor triggers in models trained on the Anthropic dataset.
The findings highlight the urgent need for advanced detection, interpretability, and unlearning techniques to safeguard future LLM deployments.

Universal Backdoors in Aligned LLMs: Insights from a Competitive Detection Challenge

Introduction and Background

LLMs like LLaMA-2 (7B) have been increasingly fine-tuned using Reinforcement Learning from Human Feedback (RLHF) to ensure their utility in generating safe and helpful responses to queries. Despite these safeguards, the introduction of backdoors through poisoning attacks during the training phase remains a potent threat. These backdoors, when triggered, enable the model to generate harmful responses otherwise curtailed by safety mechanisms. The discussed competition builds upon this premise, challenging participants to identify and exploit universal backdoors within several LLMs.

Competition Details

Model and Data Exploitation

Using the Anthropic dataset, five versions of LLaMA-2 were trained and intentionally poisoned with distinct backdoor strings. The poisoning rate was set at a substantial 25%, making the embeddings for these backdoor triggers notably distinct within the model. The competition explored the model's behavior when presented with these backdoors during inference, examining its response across a variety of prompts.

Competition Evaluation

Participants were tasked with discovering trojan terms that, when appended to any query, would result in the maximum deviation from safe responses, as quantified by a separately trained reward model. The performance of each trojan was measured against a privately held test set. A series of baseline evaluations helped in contextualizing the effectiveness of participant-generated trojans against those seeded during training.

Outcome and Submissions

The competition drew 12 valid entries. A key insight was the observation that none of the participant submissions outperformed the originally injected trojans, suggesting that the seed trojans present a significant hurdle in terms of eliciting harmful content from the models. Nonetheless, certain submissions came notably close, evidencing the ability of external agents to systematically search and exploit latent vulnerabilities within these models. Intriguing approaches included utilizing differences in token embedding spaces and applying genetic algorithms for optimizing the effectiveness of backdoor strings.

Future Directions and Research Potentials

Research on Independence from Equivalent Models: Most effective submissions relied on comparing changes in embedding vectors across different models to locate influential backdoor tokens. Future research could focus on backdoor detection methods that do not assume access to similar models poisoned under different conditions.
Mechanistic Interpretability for Backdoor Detection: Exploring interpretability methods could advance understanding of how these models segregate safe from unsafe content generation, potentially leading to better diagnostic and mitigation strategies.
Refinement of Unlearning Methods: Leveraging insights from poisoning-specific behaviors could improve strategies for "unlearning" harmful capabilities in LLMs without degrading their overall performance.
Variation in Poisoning Rates: Exploring the impact of reducing poisoning rates could provide insights into the minimum effective dosing for successful backdoor insertion and concealment, refining threat models for real-world applications.

Learnings and Concluding Thoughts

Value of Competitions in Security Research: The competition highlighted a need for re-evaluating the design and objectives of competitions in ML security to enhance their relevance and impact on practical security challenges in AI.
Impact of Computational Resources: Providing computational grants significantly lowered the entry barrier, enabling wider participation and potentially richer diversity in approach and solution strategies.
Early Career Incentives: The integration of presentation opportunities at conferences as part of the prize was particularly beneficial for early career researchers, providing them with essential exposure and networking opportunities.

Acknowledgments

The contributions of all participants, support from IEEE SaTML 2024, and funding from Open Philanthropy for prizes and grants were instrumental in the successful execution of this competition.

The detailed breakdown of the methods and outcomes not only underscores the persistent vulnerabilities of aligned LLMs to sophisticated backdoor attacks but also sets the stage for further research into robust detection and mitigation strategies that could safeguard the next generation of AI deployments.