Measuring Impacts of Poisoning on Model Parameters and Neuron Activations: A Case Study of Poisoning CodeBERT (2402.12936v2)
Abstract: LLMs have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models. Specifically, we examine attention weights and biases, activation values, and context embeddings of the clean and poisoned CodeBERT models. Our results suggest noticeable patterns in activation values and context embeddings of poisoned samples for the poisoned CodeBERT model; however, attention weights and biases do not show any significant differences. This work contributes to ongoing efforts in white-box detection of backdoor signals in LLMs of code through the analysis of parameters and activations.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
- Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hQwb-lbM6EL.
- Adversarial robustness for code. In International Conference on Machine Learning, pages 896–907. PMLR, 2020.
- On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology (IST), 135(106552):1–13, 2021.
- Codeattack: Code-based adversarial attacks for pre-trained programming language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 14892–14900, 2023.
- G. Ramakrishnan and A. Albarghouthi. Backdoors in neural models of source code. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2892–2899, Los Alamitos, CA, USA, aug 2022. IEEE Computer Society. doi: 10.1109/ICPR56361.2022.9956690.
- You see what i want you to see: Poisoning vulnerabilities in neural code search. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, page 1233–1245, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450394130. doi: 10.1145/3540250.3549153.
- Multi-target backdoor attacks for code pre-trained models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7236–7254, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.399.
- Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), pages 143–153, 2019.
- Memorization and generalization in neural code intelligence models. Information and Software Technology (IST), 153(107066):1–20, 2023.
- What do code models memorize? an empirical study on large language models of code. arXiv preprint arXiv:2308.09932, 2023.
- You autocomplete me: Poisoning vulnerabilities in neural code completion. In 30th USENIX Security Symposium (USENIX Security 21), pages 1559–1575. USENIX Association, August 2021. ISBN 978-1-939133-24-3.
- CoProtector: Protect open-source code against unauthorized training usage with data poisoning. In Proceedings of the ACM Web Conference 2022, WWW ’22, page 652–660, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450390965. doi: 10.1145/3485447.3512225.
- A survey of trojans in neural models of source code: Taxonomy and techniques. arXiv preprint arXiv:2305.03803, 2023a.
- Trojanedcm: A repository for poisoned neural models of source code. arXiv preprint arXiv:2311.14850, 2023b.
- Spectral signatures in backdoor attacks. Advances in neural information processing systems (NeurIPS), 31, 2018.
- Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728, 2018.
- ONION: A simple and effective defense against textual backdoor attacks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9558–9566, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.752.
- Occlusion-based detection of trojan-triggering inputs in large language models of code. CoRR, abs/2312.04004, 2023c. doi: 10.48550/ARXIV.2312.04004. URL https://doi.org/10.48550/arXiv.2312.04004.
- Poison attack and poison detection on deep source code processing models. ACM Transactions on Software Engineering and Methodology, 2023b. ISSN 1049-331X. doi: 10.1145/3630008. URL https://doi.org/10.1145/3630008.
- CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP, pages 1536–1547, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139.
- Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. Curran Associates Inc., Red Hook, NY, USA, 2019.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Part of Advances in Neural Information Processing Systems, Volume 30, NIPS 2017, pages 5998–6008, Red Hook, NY, USA, 2017. Curran Associates Inc.
- Learning and evaluating contextual embedding of source code. In International conference on machine learning, pages 5110–5121. PMLR, 2020.
- James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Emanuel Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):pp. 1065–1076, 1962.
- Can adversarial weight perturbations inject neural backdoors. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2029–2032, 2020.
- One-shot neural backdoor erasing via adversarial weight masking. Advances in Neural Information Processing Systems, 35:22285–22299, 2022.
- Trojan model detection using activation optimization. CoRR, abs/2306.04877, 2023. doi: 10.48550/ARXIV.2306.04877. URL https://doi.org/10.48550/arXiv.2306.04877.