Papers
Topics
Authors
Recent
Search
2000 character limit reached

Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

Published 29 Feb 2024 in cs.CL | (2402.19334v2)

Abstract: The democratization of pre-trained LLMs through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising NLP system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we verify our hypothesis on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks on classification and instruction-tuned tasks without additional resources or specific knowledge. Our approach consistently outperforms recent advanced baselines, leading to an average of about 75% reduction in the attack success rate. Since model merging has been an established approach for improving model performance, the extra advantage it provides regarding defense can be seen as a cost-free bonus.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Shun-ichi Amari. 1996. Neural learning in structured parameter spaces-natural riemannian gradient. Advances in neural information processing systems, 9.
  2. Expose backdoors on the way: A feature-based efficient defense against textual backdoor attacks. arXiv preprint arXiv:2210.07907.
  3. Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044.
  4. A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Ronald A Fisher. 1922. On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604):309–368.
  7. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733.
  8. IMBERT: Making BERT immune to insertion-based backdoor attacks. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 287–301, Toronto, Canada. Association for Computational Linguistics.
  9. Mitigating backdoor poisoning attacks through the lens of spurious correlation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 953–967, Singapore. Association for Computational Linguistics.
  10. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  11. Daniel Huynh and Jade Hardouin. 2023. Poisongpt: How we hid a lobotomized llm on hugging face to spread fake news. Mithril Security Blog.
  12. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
  13. Mistral 7b. arXiv preprint arXiv:2310.06825.
  14. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849.
  15. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  16. Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660.
  17. Backdoor attacks on pre-trained models by layerwise weight poisoning. arXiv preprint arXiv:2108.13888.
  18. Anti-backdoor learning: Training clean models on poisoned data. Advances in Neural Information Processing Systems, 34:14900–14912.
  19. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  20. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  21. Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716.
  22. Onion: A simple and effective defense against textual backdoor attacks. arXiv preprint arXiv:2011.10369.
  23. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. arXiv preprint arXiv:2105.12400.
  24. Turn the combination lock: Learnable textual backdoor attacks via word substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4873–4883, Online. Association for Computational Linguistics.
  25. Language models are unsupervised multitask learners.
  26. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  28. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  29. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  30. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR.
  31. A targeted attack on black-box neural machine translation with parallel data poisoning. In Proceedings of the Web Conference 2021, WWW ’21. ACM.
  32. Ties-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
  33. Bite: Textual backdoor attacks with iterative trigger injection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12951–12968.
  34. Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666.
  35. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
Citations (6)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 10 likes about this paper.