Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Poisoning Programs by Un-Repairing Code: Security Concerns of AI-generated Code (2403.06675v1)

Published 11 Mar 2024 in cs.CR, cs.AI, and cs.SE

Abstract: AI-based code generators have gained a fundamental role in assisting developers in writing software starting from natural language (NL). However, since these LLMs are trained on massive volumes of data collected from unreliable online sources (e.g., GitHub, Hugging Face), AI models become an easy target for data poisoning attacks, in which an attacker corrupts the training data by injecting a small amount of poison into it, i.e., astutely crafted malicious samples. In this position paper, we address the security of AI code generators by identifying a novel data poisoning attack that results in the generation of vulnerable code. Next, we devise an extensive evaluation of how these attacks impact state-of-the-art models for code generation. Lastly, we discuss potential solutions to overcome this threat.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. A. Mastropaolo, S. Scalabrino, N. Cooper, D. N. Palacio, D. Poshyvanyk, R. Oliveto, and G. Bavota, “Studying the usage of text-to-text transfer transformer to support code-related tasks,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).   IEEE, 2021.
  2. Y. Wan, S. Zhang, H. Zhang, Y. Sui, G. Xu, D. Yao, H. Jin, and L. Sun, “You see what i want you to see: poisoning vulnerabilities in neural code search,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1233–1245.
  3. S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, and J. Lu, “Hidden backdoors in human-centric language models,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021, pp. 3123–3140.
  4. R. Schuster, C. Song, E. Tromer, and V. Shmatikov, “You autocomplete me: Poisoning vulnerabilities in neural code completion,” in 30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 1559–1575.
  5. Z. Tian, L. Cui, J. Liang, and S. Yu, “A comprehensive survey on poisoning attacks and countermeasures in machine learning,” ACM Comput. Surv., vol. 55, no. 8, pp. 166:1–166:35, 2023. [Online]. Available: https://doi.org/10.1145/3551636
  6. A. E. Cinà, K. Grosse, A. Demontis, B. Biggio, F. Roli, and M. Pelillo, “Machine learning security against data poisoning: Are we there yet?” arXiv preprint arXiv:2204.05986, 2022.
  7. D. A. Wheeler, “Secure-Programs-HOWTO,” https://dwheeler.com/secure-programs/Secure-Programs-HOWTO/handle-metacharacters.html, 2015.
  8. A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein, “Poison frogs! targeted clean-label poisoning attacks on neural networks,” Advances in neural information processing systems, 2018.
  9. T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnerabilities in the machine learning model supply chain,” arXiv preprint arXiv:1708.06733, 2017.
  10. J. Wang, C. Xu, F. Guzmán, A. El-Kishky, Y. Tang, B. Rubinstein, and T. Cohn, “Putting words into the system’s mouth: A targeted attack on neural machine translation using monolingual data poisoning,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Aug. 2021. [Online]. Available: https://aclanthology.org/2021.findings-acl.127
  11. C. Xu, J. Wang, Y. Tang, F. Guzmán, B. I. Rubinstein, and T. Cohn, “Targeted poisoning attacks on black-box neural machine translation,” arXiv preprint arXiv:2011.00675, 2020.
  12. J. Li, Z. Li, H. Zhang, G. Li, Z. Jin, X. Hu, and X. Xia, “Poison attack and defense on deep source code processing models,” arXiv preprint arXiv:2210.17029, 2022.
  13. G. Ramakrishnan and A. Albarghouthi, “Backdoors in neural models of source code,” in 2022 26th International Conference on Pattern Recognition (ICPR).   IEEE, 2022, pp. 2892–2899.
  14. G. Nikitopoulos, K. Dritsa, P. Louridas, and D. Mitropoulos, “Crossvul: a cross-language vulnerability dataset with commit data,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021.
  15. NIST, “Juliet Test Suite, National Institute of Standard and Technology,” https://samate.nist.gov/SARD/test-suites?category=Stand-alone+Suites, 2020.
  16. J. R. Douceur, “The sybil attack,” in International workshop on peer-to-peer systems.   Springer, 2002, pp. 251–260.
  17. P. Liguori, C. Improta, R. Natella, B. Cukic, and D. Cotroneo, “Who evaluates the evaluators? on automatic metrics for assessing ai-based offensive code generators,” Expert Systems with Applications, vol. 225, p. 120073, 2023.
  18. K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” in International symposium on research in attacks, intrusions, and defenses.   Springer, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Cristina Improta (9 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.