BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models (2406.17092v1)

Published 24 Jun 2024 in cs.CR and cs.AI

Abstract: Safety backdoor attacks in LLMs enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from >95% to <1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. Requiring only defender-defined safe and unwanted behaviors, BEEAR represents a step towards practical defenses against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces BEEAR, a bi-level optimization framework that exploits uniform embedding drifts to reduce attack success rates from over 95% to below 10% in various backdoor scenarios.
Methodologically, BEEAR employs a dual-stage process—Backdoor Embedding Entrapment and Adversarial Removal—to identify and mitigate malicious triggers without prior trigger assumptions.
Experimental results confirm that BEEAR maintains or improves model helpfulness, making it a practical and robust defense strategy for instruction-tuned language models.

An Essay on "BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned LLMs"

Introduction

This paper addresses a significant issue in the deployment of instruction-tuned LLMs: safety backdoor attacks. These attacks stealthily embed unsafe behaviors in LLMs, triggered by specific inputs, while maintaining an illusion of normal, safe operation during standard use. The difficulty of detecting and mitigating these attacks arises from the high-dimensional token space, the variability of triggers, and the multiplicity of potential malicious behaviors. The authors introduce BEEAR (Backdoor Embedding Entrapment and Adversarial Removal), a novel method that harnesses the observation of uniform embedding drifts induced by backdoor triggers to develop an effective mitigation strategy.

Methodology

The key insight underlying BEEAR is that backdoor triggers, despite their variability in form and targeted malicious behavior, cause a uniform drift in the embedding space of the compromised model. BEEAR operationalizes this by implementing a bi-level optimization approach:

Backdoor Embedding Entrapment (BEE): The inner optimization level identifies universal perturbations in the embedding space that drive the model towards attacker-defined unwanted behaviors.
Adversarial Removal (AR): The outer optimization level fine-tunes the model parameters to reinforce safe behaviors in the presence of these identified perturbations.

The approach leverages defender-defined sets of safe behaviors ( $\mathcal{D}_{\text{SA}}$ ), harmful behaviors ( $\mathcal{D}_{\text{SA-H}}$ ), and performance anchoring ( $\mathcal{D}_{\text{PA}}$ ) to guide the optimization process. Consequently, BEEAR does not rely on assumptions about the trigger's location or form, making it a practical defense mechanism.

Experimental Evaluation

The authors evaluate BEEAR across eight different settings that encompass three main scenarios of backdoor attacks: supervised fine-tuning (SFT) with attacker-controlled data, poisoning during the RLHF process, and sleeper agents trained using a mixture of benign and poisoned data. These attacks, targeting various model behaviors, include prefixes, suffixes, and embedded triggers across multiple token lengths.

Results

The numerical results are compelling:

BEEAR demonstrates a significant reduction in the attack success rate (ASR) across all evaluated settings, reducing the ASR from over 95% to less than 10% in most cases, and to below 1% in several instances.
Notably, BEEAR achieves this mitigation without compromising the helpfulness of the LLMs, as evidenced by stable or improved MT-Bench scores (helpfulness metrics).

The detailed results for each setting, including GPT-4 scoring and CodeQL analysis for code generation tasks, effectively illustrate BEEAR's robustness and practicality.

Implications and Future Directions

The practical implications of BEEAR are substantial. By providing a defense mechanism that operates without specific knowledge of trigger characteristics, BEEAR offers a proactive approach to securing LLMs against backdoor attacks. This versatility suggests that BEEAR could be integrated as a standard step in the model release process, enhancing the safety and reliability of LLMs in critical applications.

Theoretically, BEEAR's success underscores the importance of embedding space dynamics in understanding and mitigating backdoor behaviors. The uniform drift observation opens new avenues for exploring embedding-based defenses, potentially applicable to other modalities and broader security challenges.

Future developments could explore adaptive attacks that might circumvent BEEAR's bi-level defense, as well as further enhancing the method's efficiency and generalizability. Also, expanding the benchmark to include a wider range of tasks and model capabilities would provide a more comprehensive evaluation of BEEAR's impact on LLM utility.

Conclusion

This paper introduces BEEAR, an innovative approach to backdoor defense in instruction-tuned LLMs, demonstrating strong effectiveness in mitigating diverse backdoor attacks while maintaining model utility. BEEAR leverages the critical insight of uniform embedding drifts caused by backdoor triggers, employing a bi-level optimization strategy that could serve as a foundational technique for future advancements in AI safety and security.

PDF Markdown

Related Papers

Tweets

https://twitter.com/EasonZeng623/status/1806815415244665175

https://twitter.com/GptMaestro/status/1807272689561927732