Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Backdooring Neural Code Search (2305.17506v2)

Published 27 May 2023 in cs.SE, cs.AI, and cs.CL

Abstract: Reusing off-the-shelf code snippets from online repositories is a common practice, which significantly enhances the productivity of software developers. To find desired code snippets, developers resort to code search engines through natural language queries. Neural code search models are hence behind many such engines. These models are based on deep learning and gain substantial attention due to their impressive performance. However, the security aspect of these models is rarely studied. Particularly, an adversary can inject a backdoor in neural code search models, which return buggy or even vulnerable code with security/privacy issues. This may impact the downstream software (e.g., stock trading systems and autonomous driving) and cause financial loss and/or life-threatening incidents. In this paper, we demonstrate such attacks are feasible and can be quite stealthy. By simply modifying one variable/function name, the attacker can make buggy/vulnerable code rank in the top 11%. Our attack BADCODE features a special trigger generation and injection procedure, making the attack more effective and stealthy. The evaluation is conducted on two neural code search models and the results show our attack outperforms baselines by 60%. Our user study demonstrates that our attack is more stealthy than the baseline by two times based on the F1 score.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. code2seq: Generating sequences from structured representations of code. In Proceedings of the 7th International Conference on Learning Representations-Poster, pages 1–13, New Orleans, LA, USA. OpenReview.net.
  2. Code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL):40:1–40:29.
  3. Inc. Atlassian. 2010. BitBucket. site: https://bitbucket.org. Accessed: 2023.
  4. T-miner: A generative approach to defend against trojan attacks on dnn-based text classification. In Proceedings of the 30th USENIX Security Symposium, pages 2255–2272. USENIX Association.
  5. Eugene Bagdasaryan and Vitaly Shmatikov. 2021. Blind backdoors in deep learning models. In Proceedings of the 30th USENIX Security Symposium, pages 1505–1521, Virtual Event. USENIX Association.
  6. Example-centric programming: integrating web search into the development environment. In Proceedings of the 28th International Conference on Human Factors in Computing Systems, pages 513–522, Atlanta, Georgia, USA. ACM.
  7. Detecting backdoor attacks on deep neural networks by activation clustering. CoRR, abs/1811.03728.
  8. Badnl: Backdoor attacks against NLP models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, pages 554–569, Virtual Event, USA. ACM.
  9. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407.
  10. Functional code clone detection with syntax and semantics fusion learning. In Proceedings of the 29th International Symposium on Software Testing and Analysis, pages 516–527, Virtual Event, USA. ACM.
  11. Codebert: A pre-trained model for programming and natural languages. In Proceedings of the 25th Conference on Empirical Methods in Natural Language Processing: Findings, pages 1536–1547, Online Event. Association for Computational Linguistics.
  12. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
  13. TECCD: A tree embedding approach for code clone detection. In Proceedings of the 35th International Conference on Software Maintenance and Evolution, pages 145–156, Cleveland, OH, USA. IEEE.
  14. Inc. GitHub. 2008. GitHub. site: https://github.com. Accessed: 2023.
  15. Badnets: Identifying vulnerabilities in the machine learning model supply chain. CoRR, abs/1708.06733:1–13.
  16. Deep code search. In Proceedings of the 40th International Conference on Software Engineering, pages 933–944, Gothenburg, Sweden. ACM.
  17. Unixcoder: Unified cross-modal pre-training for code representation. CoRR, abs/2203.03850.
  18. Graphcodebert: Pre-training code representations with data flow. In 9th International Conference on Learning Representations, Virtual Event, Austria. OpenReview.net.
  19. Deep code comment generation. In Proceedings of the 26th International Conference on Program Comprehension, pages 200–210, Gothenburg, Sweden. ACM.
  20. Codesearchnet challenge: Evaluating the state of semantic code search. CoRR, abs/1909.09436:1–6.
  21. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering, pages 664–675, Hyderabad, India. ACM.
  22. Facoy: a code-to-code search engine. In Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden. ACM.
  23. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3th International Conference on Learning Representations – Poster, pages 1–15, San Diego, CA, USA. OpenReview.net.
  24. Weight poisoning attacks on pretrained models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2793–2806, Online. Association for Computational Linguistics.
  25. Thesaurus-based automatic query expansion for interface-driven code search. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 212–221, Hyderabad, India. ACM.
  26. Graphsearchnet: Enhancing gnns via capturing global dependencies for semantic code search. IEEE Transactions on Software Engineering, 49(4):2839–2855.
  27. Trojaning attack on neural networks. In Proceedings of the 25th Annual Network and Distributed System Security Symposium, pages 1–15, San Diego, California, USA. The Internet Society.
  28. Piccolo: Exposing complex backdoors in NLP transformer models. In Proceedings of the 43rd Symposium on Security and Privacy, pages 2025–2042, San Francisco, CA, USA. IEEE.
  29. Portfolio: finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering, pages 111–120, Waikiki, Honolulu , HI, USA. ACM.
  30. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 9(5):771–783.
  31. Hidden trigger backdoor attack on NLP models via linguistic style manipulation. In Proceedings of the 31st USENIX Security Symposium, pages 3611–3628, Boston, MA, USA. USENIX Association.
  32. Source code exploration with google. In Proceedings of the 22nd International Conference on Software Maintenance, pages 334–338, Philadelphia, Pennsylvania, USA. IEEE Computer Society.
  33. Turn the combination lock: Learnable textual backdoor attacks via word substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pages 4873–4883, Virtual Event. Association for Computational Linguistics.
  34. Goutham Ramakrishnan and Aws Albarghouthi. 2020. Backdoors in neural models of source code. CoRR, abs/2006.06841:1–11.
  35. You autocomplete me: Poisoning vulnerabilities in neural code completion. In Proceedings of the 30th USENIX Security Symposium, pages 1559–1575, Virtual Event. USENIX Association.
  36. Explanation-guided backdoor poisoning attacks against malware classifiers. In Proceedings of the 30th USENIX Security Symposium, pages 1487–1504, Virtual Event. USENIX Association.
  37. Improving code search with co-attentive representation learning. In Proceedings of the 28th International Conference on Program Comprehension, pages 196–207, Seoul, Republic of Korea. ACM.
  38. Code search based on context-aware code translation. In Proceedings of the 44th International Conference on Software Engineering, pages 388–400, Pittsburgh, PA, USA. ACM.
  39. Coprotector: Protect open-source code against unauthorized training usage with data poisoning. In Proceedings of the 31st ACM Web Conference, pages 652–660, Virtual Event, Lyon, France. ACM.
  40. Model orthogonalization: Class distance hardening in neural networks for better security. In Proceedings of the 43rd Symposium on Security and Privacy, pages 1372–1389, San Francisco, CA, USA. IEEE.
  41. Spectral signatures in backdoor attacks. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, pages 8011–8021, Montréal, Canada.
  42. Multi-modal attention network learning for semantic source code retrieval. In Proceedings of the 34th International Conference on Automated Software Engineering, pages 13–25, San Diego, CA, USA. IEEE.
  43. You see what i want you to see: Poisoning vulnerabilities in neural code search. In Proceedings of the 30th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, page to be appear, Singapore. ACM.
  44. Active code search: incorporating user feedback to improve code search relevance. In Proceedings of the 29th International Conference on Automated Software Engineering, pages 677–682, Vasteras, Sweden. ACM.
  45. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 26th Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, Virtual Event / Punta Cana, Dominican Republic. Association for Computational Linguistics.
Citations (12)

Summary

We haven't generated a summary for this paper yet.