Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks (2101.06969v5)

Published 18 Jan 2021 in cs.CL and cs.CV

Abstract: Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both NLP and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, 2019.
  2. Very deep convolutional networks for large-scale image recognition. In Proceedings of ICLR, 2015.
  3. Deep learning backdoors. arXiv preprint arXiv:2007.08273, 2020.
  4. Security risks in deep learning implementations. In Proceedings of SPW, 2018.
  5. Weight poisoning attacks on pretrained models. In Proceedings of ACL, 2020.
  6. Poison attacks against text datasets with conditional adversarially regularized autoencoder. arXiv preprint arXiv:2010.02684, 2020.
  7. Model-reuse attacks on deep learning systems. In Proceedings of CCS, 2018.
  8. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Yoshua Bengio and Yann LeCun, editors, Proceedings of ICLR, 2016.
  9. Revealing the dark secrets of BERT. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of EMNLP-IJCNLP, 2019.
  10. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of ACL, 2019.
  11. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020.
  13. Albert: A lite bert for self-supervised learning of language representations. In Proceedings of ICLR, 2020.
  14. Deep residual learning for image recognition. In Proceedings of CVPR, 2016.
  15. Densely connected convolutional networks. In Proceedings of CVPR, 2017.
  16. Mlp-mixer: An all-mlp architecture for vision. CoRR, abs/2105.01601, 2021.
  17. Pay attention to mlps. CoRR, 2021.
  18. Explaining and harnessing adversarial examples. In Yoshua Bengio and Yann LeCun, editors, Proceedings of ICLR, 2015.
  19. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of AAAI, 2020.
  20. Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of ACL, 2020.
  21. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.
  22. Programmable neural network trojan for pre-trained feature extractor. CoRR, abs/1901.07766, 2019.
  23. Humpty dumpty: Controlling word meanings via corpus poisoning. In Proceedings of IEEE S&P, 2020.
  24. Extracting training data from large language models. arXiv preprint arXiv:2012.07805, 2020.
  25. Trojaning attack on neural networks. In Proceedings of NDSS, 2018.
  26. A backdoor attack against lstm-based text classification systems. IEEE Access, 7, 2019.
  27. Badnl: Backdoor attacks against nlp models. arXiv preprint arXiv:2006.01043, 2020.
  28. Lichao Sun. Natural backdoor attack on text data. arXiv preprint arXiv:2006.16176, 2020.
  29. Trojaning language models for fun and profit. arXiv preprint arXiv:2008.00312, 2020.
  30. Blind backdoors in deep learning models. arXiv preprint arXiv:2005.03823, 2020.
  31. A target-agnostic attack on deep models: Exploiting security vulnerabilities of transfer learning. In Proceedings of ICLR, 2020.
  32. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, 2013.
  33. Predicting the type and target of offensive posts in social media. In Proceedings of NAACL-HLT, 2019.
  34. Spam filtering with naive bayes - which naive bayes? In Proceedings of CEAS, 2006.
  35. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 2012.
  36. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of ICCV, 2015.
  37. A downsampled variant of imagenet as an alternative to the CIFAR datasets. CoRR, abs/1707.08819, 2017.
  38. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of ICML, volume 37, pages 448–456, 2015.
  39. Fine-pruning: Defending against backdooring attacks on deep neural networks. In Proceedings of RAID, volume 11050, pages 273–294, 2018.
  40. Neural attention distillation: Erasing backdoor triggers from deep neural networks. International Conference on Learning Representations (ICLR), 2021.
  41. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Proceedings of IEEE Symposium on Security and Privacy, pages 707–723, 2019.
Citations (62)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub