Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment (2402.14968v3)
Abstract: Despite the general capabilities of LLMs (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning users' uploaded examples contain just a few harmful examples. Though potential defenses have been proposed that the service providers can integrate safety examples into the fine-tuning dataset to reduce safety issues, such approaches require incorporating a substantial amount of data, making it inefficient. To effectively defend against the FJAttack with limited safety examples under LMaaS, we propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. In particular, service providers will construct prefixed safety examples with a secret prompt, acting as a "backdoor trigger". By integrating prefixed safety examples into the fine-tuning dataset, the subsequent fine-tuning process effectively acts as the "backdoor attack", establishing a strong correlation between the secret prompt and safety generations. Consequently, safe responses are ensured once service providers prepend this secret prompt ahead of any user input during inference. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance. Furthermore, we also present the effectiveness of our method in a more practical setting where the fine-tuning data consists of both FJAttack examples and the fine-tuning task data.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- b mc2. 2023. sql-create-context dataset. This dataset was created by modifying data from the following sources: Zhong et al. (2017); Yu et al. (2018).
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm/.
- A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878.
- SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
- Watermarking pre-trained language models with backdooring. arXiv preprint arXiv:2210.07543.
- Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244.
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Backdoor attacks for in-context learning with language models. arXiv preprint arXiv:2307.14692.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
- Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
- Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems.
- Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. arXiv preprint arXiv:2309.06256.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Reflection backdoor: A natural backdoor attack on deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 182–199. Springer.
- OpenAI. 2024a. GPT-3.5-Turbo. https://platform.openai.com/docs/models/gpt-3-5-turbo.
- OpenAI. 2024b. GPT-4. https://openai.com/research/gpt-4.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- GPT-3.5 turbo fine-tuning and API updates. https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
- Improving language understanding by generative pre-training.
- Javier Rando and Florian Tramèr. 2023. Universal jailbreak backdoors from poisoned human feedback. arXiv preprint arXiv:2311.14455.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch. In Advances in Neural Information Processing Systems.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710.
- Instructional fingerprinting of large language models. arXiv preprint arXiv:2401.12255.
- Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2048–2058.
- Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.
- Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823.
- Narcissus: A practical clean-label backdoor attack with limited information. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 771–785.
- Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
- Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103.
- Jiongxiao Wang (15 papers)
- Jiazhao Li (5 papers)
- Yiquan Li (6 papers)
- Xiangyu Qi (21 papers)
- Muhao Chen (159 papers)
- Junjie Hu (111 papers)
- Yixuan Li (183 papers)
- Bo Li (1107 papers)
- Chaowei Xiao (110 papers)
- Patrick McDaniel (70 papers)