Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense (2410.09838v2)
Abstract: Backdoor attacks pose a significant threat to Deep Neural Networks (DNNs) as they allow attackers to manipulate model predictions with backdoor triggers. To address these security vulnerabilities, various backdoor purification methods have been proposed to purify compromised models. Typically, these purified models exhibit low Attack Success Rates (ASR), rendering them resistant to backdoored inputs. However, Does achieving a low ASR through current safety purification methods truly eliminate learned backdoor features from the pretraining phase? In this paper, we provide an affirmative answer to this question by thoroughly investigating the Post-Purification Robustness of current backdoor purification methods. We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior, even when further fine-tuning of purified models is performed using a very small number of poisoned samples. Based on this, we further propose the practical Query-based Reactivation Attack (QRA) which could effectively reactivate the backdoor by merely querying purified models. We find the failure to achieve satisfactory post-purification robustness stems from the insufficient deviation of purified models from the backdoored model along the backdoor-connected path. To improve the post-purification robustness, we propose a straightforward tuning defense, Path-Aware Minimization (PAM), which promotes deviation along backdoor-connected paths with extra model updates. Extensive experiments demonstrate that PAM significantly improves post-purification robustness while maintaining a good clean accuracy and low ASR. Our work provides a new perspective on understanding the effectiveness of backdoor safety tuning and highlights the importance of faithfully assessing the model's safety.
- Foundational challenges in assuring alignment and safety of large language models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. Survey Certification, Expert Certification.
- How to backdoor federated learning. In International Conference on Artificial Intelligence and Statistics, pages 2938–2948. PMLR, 2020.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- Poisoning and backdooring contrastive learning. In International Conference on Learning Representations, 2022.
- Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
- A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.
- Paul Christiano. Mechanistic anomaly detection and elk. Alignment Research Center, 2022. URL https://ai-alignment.com/mechanistic-anomaly-detection-and-elk-fb84f4c6d0dc.
- Eliciting latent knowledge: How to tell if your eyes deceive you. technical report. Alignment Research Center, 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit.
- Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
- Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
- Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Backdoor defense via decoupling the training process. In International Conference on Learning Representations, 2022.
- Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
- Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
- Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
- Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
- Adversarial machine learning-industry perspectives. In 2020 IEEE security and privacy workshops (SPW), pages 69–75. IEEE, 2020.
- Anti-backdoor learning: Training clean models on poisoned data. Advances in Neural Information Processing Systems, 34:14900–14912, 2021a.
- Reconstructive neuron pruning for backdoor defense. In International Conference on Machine Learning, pages 19837–19854. PMLR, 2023.
- Invisible backdoor attack with sample-specific triggers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16463–16472, 2021b.
- Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pages 273–294. Springer, 2018.
- Delving into transferable adversarial examples and black-box attacks. In International Conference on Learning Representations, 2016.
- Mechanistic mode connectivity. In International Conference on Machine Learning, pages 22965–23004. PMLR, 2023.
- Towards stable backdoor purification through feature shift tuning. Advances in Neural Information Processing Systems, 36, 2024.
- What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
- Wanet - imperceptible warping-based backdoor attack. In International Conference on Learning Representations, 2021.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Revisiting the assumption of latent separability for backdoor defenses. In The eleventh international conference on learning representations, 2023a.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023b.
- Boosting the transferability of adversarial attacks with reverse adversarial perturbation. Advances in neural information processing systems, 35:29845–29858, 2022.
- Revisiting personalized federated learning: Robustness against backdoor attacks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 4743–4755, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701030.
- Competition report: Finding universal jailbreak backdoors in aligned llms. arXiv preprint arXiv:2404.14461, 2024.
- Back to the drawing board: A critical evaluation of poisoning attacks on production federated learning. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1354–1371. IEEE, 2022.
- Fast yet effective machine unlearning. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- Label-consistent backdoor attacks. arXiv preprint arXiv:1912.02771, 2019.
- Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723. IEEE, 2019.
- Universal backdoor attacks detection via adaptive adversarial probe. arXiv preprint arXiv:2209.05244, 2022a.
- Unicorn: A unified backdoor trigger inversion framework. In The Eleventh International Conference on Learning Representations, 2022b.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
- Backdoorbench: A comprehensive benchmark of backdoor learning. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Adversarial neuron pruning purifies backdoored deep models. Advances in Neural Information Processing Systems, 34:16913–16925, 2021.
- Towards reliable and efficient backdoor trigger inversion via decoupling benign features. In The Twelfth International Conference on Learning Representations, 2023.
- Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
- GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher. In The Twelfth International Conference on Learning Representations, 2024.
- Adversarial unlearning of backdoors via implicit hypergradient. In International Conference on Learning Representations, 2022.
- Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
- Enhancing fine-tuning based backdoor defense with sharpness-aware minimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4466–4477, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.