Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor (2409.01952v2)

Published 3 Sep 2024 in cs.CR, cs.AI, and cs.AR

Abstract: Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker's choice. While such black-box attacks have been well explored in both computer vision and NLP, backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we propose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights using Gaussian noise to disturb the feature distribution of the baseline model. We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets. We demonstrate that the training-free architectural backdoor on a LLM poses a genuine threat. Unlike the-state-of-art work, it can survive the rigorous fine-tuning and retraining process, as well as evade output probability-based defense methods (i.e. BDDR). All the code and data is available https://github.com/SiSL-URI/Arch_Backdoor_LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
  2. {{\{{T-Miner}}\}}: A generative approach to defend against trojan attacks on {{\{{DNN-based}}\}} text classification. In 30th USENIX Security Symposium (USENIX Security 21), pages 2255–2272, 2021.
  3. Architectural backdoors in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24595–24604, 2023.
  4. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, pages 554–569, 2021.
  5. Textual backdoor attacks can be more harmful via two simple tricks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, December 2022.
  6. A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878, 2019.
  7. Ranking a stream of news. In Proceedings of the 14th international conference on World Wide Web, pages 97–106, 2005.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Triggerless backdoor attack for nlp tasks with clean labels. arXiv preprint arXiv:2111.07970, 2021.
  10. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.
  11. Composite backdoor attacks against large language models. arXiv preprint arXiv:2310.07676, 2023.
  12. Training-free lexical backdoor attacks on language models. In Proceedings of the ACM Web Conference 2023, pages 2198–2208, 2023.
  13. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622:178–210, 2023.
  14. Principal component analysis: a review and recent developments. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016.
  15. Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660, 2020.
  16. Hidden backdoors in human-centric language models. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3123–3140, 2021.
  17. Membership inference attacks by exploiting loss trajectory. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, CCS ’22, page 2085–2098, 2022.
  18. Nltk: The natural language toolkit. arXiv preprint cs/0205028, 2002.
  19. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
  20. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020.
  21. Hidden trigger backdoor attack on {{\{{NLP}}\}} models via linguistic style manipulation. In 31st USENIX Security Symposium (USENIX Security 22), pages 3611–3628, 2022.
  22. Onion: A simple and effective defense against textual backdoor attacks. arXiv preprint arXiv:2011.10369, 2020.
  23. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  24. Towards data-free model stealing in a hard label setting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15284–15293, June 2022.
  25. CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
  26. Bddr: An effective defense against textual backdoor attacks. Computers & Security, 110:102433, 2021.
  27. Punctuation matters! stealthy backdoor attack for language models. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 524–536. Springer, 2023.
  28. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
  29. Emtract: Extracting emotions from social media. Available at SSRN 3975884, 2023.
  30. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  31. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience, 2018(1):7068349, 2018.
  32. Bite: Textual backdoor attacks with iterative trigger injection. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
  33. Improving probability-based prompt selection through unified evaluation and analysis. arXiv preprint arXiv:2305.14877, 2023.
  34. Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models. arXiv preprint arXiv:2103.15543, 2021.
  35. Rap: Robustness-aware perturbations for defending against backdoor attacks on nlp models. arXiv preprint arXiv:2110.07831, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Abdullah Arafat Miah (2 papers)
  2. Yu Bi (10 papers)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com