Two Heads are Better than One: Nested PoE for Robust Defense Against Multi-Backdoors (2404.02356v1)
Abstract: Data poisoning backdoor attacks can cause undesirable behaviors in LLMs, and defending against them is of increasing importance. Existing defense mechanisms often assume that only one type of trigger is adopted by the attacker, while defending against multiple simultaneous and independent trigger types necessitates general defense frameworks and is relatively unexplored. In this paper, we propose Nested Product of Experts(NPoE) defense framework, which involves a mixture of experts (MoE) as a trigger-only ensemble within the PoE defense framework to simultaneously defend against multiple trigger types. During NPoE training, the main model is trained in an ensemble with a mixture of smaller expert models that learn the features of backdoor triggers. At inference time, only the main model is used. Experimental results on sentiment analysis, hate speech detection, and question classification tasks demonstrate that NPoE effectively defends against a variety of triggers both separately and in trigger mixtures. Due to the versatility of the MoE structure in NPoE, this framework can be further expanded to defend against other attack settings
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
- Chuanshuai Chen and Jiazhu Dai. 2021. Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification.
- BadNL: Backdoor attacks against NLP models with semantic-preserving improvements. In Annual Computer Security Applications Conference. ACM.
- Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4069–4082, Hong Kong, China. Association for Computational Linguistics.
- A unified evaluation of textual backdoor learning: Frameworks and benchmarks. Advances in Neural Information Processing Systems, 35:5009–5023.
- Jiazhu Dai and Chuanshuai Chen. 2019. A backdoor attack against lstm-based text classification systems.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Transactions on Dependable and Secure Computing, 19(4):2349–2364.
- Competency problems: On finding and removing artifacts in language data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1801–1813, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733.
- Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
- Geoffrey E Hinton. 2002. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800.
- Toward semantics-based answer pinpointing. In Proceedings of the First International Conference on Human Language Technology Research.
- Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
- Wedef: Weakly supervised backdoor defense for text classification. arXiv preprint arXiv:2205.11803.
- End-to-end bias mitigation by modelling biases in corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8706–8716, Online. Association for Computational Linguistics.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.
- Weight poisoning attacks on pretrained models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2793–2806, Online. Association for Computational Linguistics.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
- Backdoor attacks on pre-trained models by layerwise weight poisoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3023–3032, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Tilted empirical risk minimization. In International Conference on Learning Representations.
- BFClass: A backdoor-free text classification framework. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 444–453, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- R-drop: Regularized dropout for neural networks.
- From shortcuts to triggers: Backdoor defense with denoised poe.
- Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1930–1939.
- Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, New Orleans, Louisiana. Association for Computational Linguistics.
- ONION: A simple and effective defense against textual backdoor attacks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9558–9566, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4569–4580, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 443–453, Online. Association for Computational Linguistics.
- Turn the combination lock: Learnable textual backdoor attacks via word substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4873–4883, Online. Association for Computational Linguistics.
- Language models are unsupervised multitask learners. OpenAI Blog.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Poisoning language models during instruction tuning.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- Robust natural language understanding with residual attention debiasing. In Findings of the Association for Computational Linguistics: ACL 2023, pages 504–519, Toronto, Canada. Association for Computational Linguistics.
- Zhao Wang and Aron Culotta. 2020. Identifying spurious correlations for robust text classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3431–3440, Online. Association for Computational Linguistics.
- Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in NLP models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2048–2058, Online. Association for Computational Linguistics.
- RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8365–8381, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1415–1420, Minneapolis, Minnesota. Association for Computational Linguistics.
- Secrets of rlhf in large language models part i: Ppo.
- Backdoor embedding in convolutional neural network models via invisible perturbation. In Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy, pages 97–108.