Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code (2403.07506v1)

Published 12 Mar 2024 in cs.SE

Abstract: LLMs for code (LLM4Code), which demonstrate strong performance (e.g., high accuracy) in processing source code, have significantly transformed software engineering. Many studies separately investigate the non-functional properties of LM4Code, but there is no systematic review of how these properties are evaluated and enhanced. This paper fills this gap by thoroughly examining 146 relevant studies, thereby presenting the first systematic literature review to identify seven important properties beyond accuracy, including robustness, security, privacy, explainability, efficiency, and usability. We discuss the current state-of-the-art methods and trends, identify gaps in existing research, and present promising directions for future study.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (271)
  1. [n. d.]. A Massively Spiffy Yet Delicately Unobtrusive Compression Library. https://zlib.net/. Accessed on March 27, 2023.
  2. [n. d.]. OWASP Benchmark Project. https://owasp.org/www-project-benchmark/. Accessed: 2023-12-19.
  3. SHIELD: Thwarting Code Authorship Attribution.
  4. TrojanPuzzle: Covertly Poisoning Code-Suggestion Models. arXiv preprint arXiv:2301.02344 (2023).
  5. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4998–5007. https://doi.org/10.18653/v1/2020.acl-main.449
  6. Traces of Memorisation in Large Language Models for Code. arXiv:2312.11658 [cs.CR]
  7. Reem Aleithan. 2021. Explainable Just-in-Time Bug Prediction: Are We There Yet?. In Proceedings of the 43rd International Conference on Software Engineering: Companion Proceedings (Virtual Event, Spain) (ICSE ’21). IEEE Press, 129–131. https://doi.org/10.1109/ICSE-Companion52605.2021.00056
  8. SantaCoder: don’t reach for the stars! arXiv:2301.03988 [cs.SE]
  9. Miltiadis Allamanis. 2019. The Adverse Effects of Code Duplication in Machine Learning Models of Code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Athens, Greece) (Onward! 2019). Association for Computing Machinery, New York, NY, USA, 143–153. https://doi.org/10.1145/3359591.3359735
  10. code2seq: Generating Sequences from Structured Representations of Code. In International Conference on Learning Representations. https://openreview.net/forum?id=H1gKYo09tX
  11. Code2vec: Learning Distributed Representations of Code. Proc. ACM Program. Lang. 3, POPL, Article 40 (Jan. 2019), 29 pages. https://doi.org/10.1145/3290353
  12. Source code authorship attribution using long short-term memory based networks. In Computer Security - ESORICS 2017. Springer Verlag, 65–82. https://doi.org/10.1007/978-3-319-66402-6_6
  13. Adversarial Robustness of Program Synthesis Models. In Advances in Programming Languages and Neurosymbolic Systems Workshop. https://openreview.net/forum?id=17C-dfA5X69
  14. Assessing Robustness of ML-Based Program Analysis Tools using Metamorphic Program Transformations. In ASE 2021. 1377–1381. https://doi.org/10.1109/ASE51524.2021.9678706
  15. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]
  16. Gareth Ari Aye and Gail E. Kaiser. 2020. Sequence Model Design for Code Completion in the Modern IDE. arXiv:2004.05249 [cs.SE]
  17. Shamil Ayupov and Nadezhda Chirkova. 2022. Parameter-Efficient Finetuning of Transformers for Source Code. arXiv:2212.05901 [cs.CL]
  18. SecretBench: A Dataset of Software Secrets. In Proceedings of the 20th International Conference on Mining Software Repositories (MSR ’23). 5 pages.
  19. EW-Tune: A Framework for Privately Fine-Tuning Large Language Models with Differential Privacy. In 2022 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE. https://doi.org/10.1109/icdmw58026.2022.00078
  20. Pavol Bielik and Martin Vechev. 2020. Adversarial robustness for code. In International Conference on Machine Learning. PMLR, 896–907.
  21. An Integrative Human-Centered Architecture for Interactive Programming Assistants. In 2022 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 1–5. https://doi.org/10.1109/VL/HCC53370.2022.9833110
  22. De-Anonymizing Programmers via Code Stylometry. In Proceedings of the 24th USENIX Conference on Security Symposium (Washington, D.C.) (SEC’15). USENIX Association, USA, 255–270.
  23. Extracting Training Data from Large Language Models. In USENIX Security Symposium.
  24. Nicholas Carlini and David Wagner. 2017. Towards Evaluating the Robustness of Neural Networks. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 39–57. https://doi.org/10.1109/SP.2017.49
  25. NatGen: generative pre-training by “naturalizing” source code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (¡conf-loc¿, ¡city¿Singapore¡/city¿, ¡country¿Singapore¡/country¿, ¡/conf-loc¿) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 18–30. https://doi.org/10.1145/3540250.3549162
  26. Deep Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering 48, 09 (sep 2022), 3280–3296. https://doi.org/10.1109/TSE.2021.3087402
  27. Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications. Empirical Software Engineering 26, 5 (2021), 83.
  28. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. CoRR abs/1811.03728 (2018). arXiv:1811.03728 http://arxiv.org/abs/1811.03728
  29. Stealing Deep Reinforcement Learning Models for Fun and Profit. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security (Virtual Event, Hong Kong) (ASIA CCS ’21). Association for Computing Machinery, New York, NY, USA, 307–319. https://doi.org/10.1145/3433210.3453090
  30. Evaluating Large Language Models Trained on Code. CoRR (2021).
  31. Generating Adversarial Source Programs Using Important Tokens-based Structural Transformations. In 2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS). 173–182. https://doi.org/10.1109/ICECCS54210.2022.00029
  32. Fairness testing: A comprehensive survey and analysis of trends. (2022).
  33. TABS: Efficient Textual Adversarial Attack for Pre-trained NL Code Model Using Semantic Beam Search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5490–5498. https://doi.org/10.18653/v1/2022.emnlp-main.369
  34. To What Extent Do Deep Learning-Based Code Recommenders Generate Predictions by Cloning Code from the Training Set?. In Proceedings of the 19th International Conference on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR ’22). Association for Computing Machinery, New York, NY, USA, 167–178. https://doi.org/10.1145/3524842.3528440
  35. Counterfactual Explanations for Models of Code. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice (Pittsburgh, Pennsylvania) (ICSE-SEIP ’22). Association for Computing Machinery, New York, NY, USA, 125–134. https://doi.org/10.1145/3510457.3513081
  36. Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks. arXiv:2308.04451 [cs.CR]
  37. A Game-Based Framework to Compare Program Classifiers and Evaders. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization (Montréal, QC, Canada) (CGO 2023). Association for Computing Machinery, New York, NY, USA, 108–121. https://doi.org/10.1145/3579990.3580012
  38. QLoRA: Efficient Finetuning of Quantized LLMs.
  39. Haibiao Ding and Mansur H. Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72, 1 (2004), 49–57. https://doi.org/10.1016/S0164-1212(03)00049-9
  40. Stefania Druga and Amy J. Ko. 2023. AI Friends: A Design Framework for AI-Powered Creative Programming for Youth. arXiv:2305.10412 [cs.HC]
  41. Stefania Druga and Nancy Otero. 2023. Scratch Copilot Evaluation: Assessing AI-Assisted Creative Coding for Families. arXiv:2305.10417 [cs.HC]
  42. Understanding Promotion-as-a-Service on GitHub. In Proceedings of the 36th Annual Computer Security Applications Conference (ACSAC ’20). Association for Computing Machinery, New York, NY, USA, 597–610. https://doi.org/10.1145/3427228.3427258
  43. An Extensive Study on Adversarial Attack against Pre-trained Models of Code. (2023). arXiv:2311.07553 [cs.CR]
  44. Large Language Models for Software Engineering: Survey and Open Problems. arXiv:2310.03533 [cs.SE]
  45. Automated Detection of Password Leakage from Public GitHub Repositories. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). 175–186. https://doi.org/10.1145/3510003.3510150
  46. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 1536–1547.
  47. Claudio Ferretti and Martina Saletta. 2021. Deceiving neural source code classifiers: finding adversarial examples with grammatical evolution. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (Lille, France) (GECCO ’21). Association for Computing Machinery, New York, NY, USA, 1889–1897. https://doi.org/10.1145/3449726.3463222
  48. Improving Text-to-SQL Evaluation Methodology. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 351–360. https://doi.org/10.18653/v1/P18-1033
  49. The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian Computing Education Conference. 10–19.
  50. Tira Nur Fitria. 2021. QuillBot as an online tool: Students’ alternative in paraphrasing and rewriting of English writing. Englisia: Journal of Language, Education, and Humanities 9, 1 (2021), 183–196.
  51. Source Code Author Identification Based on N-gram Author Profiles. In Artificial Intelligence Applications and Innovations, Ilias Maglogiannis, Kostas Karpouzis, and Max Bramer (Eds.). Springer US, Boston, MA, 508–515.
  52. InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=hQwb-lbM6EL
  53. Hunting for Truth: Analyzing Explanation Methods in Learning-based Vulnerability Discovery. In 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P). 524–541. https://doi.org/10.1109/EuroSP57164.2023.00038
  54. Discrete Adversarial Attack to Models of Code. Proc. ACM Program. Lang. 7, PLDI, Article 113 (jun 2023), 24 pages. https://doi.org/10.1145/3591227
  55. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations. http://arxiv.org/abs/1412.6572
  56. Compressing bert: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307 (2020).
  57. The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 895–907. https://doi.org/10.1145/3611643.3616304
  58. GraphCodeBERT: Pre-training Code Representations with Data Flow. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
  59. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models. arXiv:2302.04012 [cs.CR]
  60. Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. arXiv:2302.05319 [cs.CR]
  61. Vincent J. Hellendoorn and Anand Ashok Sawant. 2021. The Growing Cost of Deep Learning for Source Code. Commun. ACM 65, 1 (dec 2021), 31–33. https://doi.org/10.1145/3501261
  62. Measuring Coding Challenge Competence With APPS. arXiv:2105.09938 [cs.SE]
  63. Semantic Robustness of Models of Source Code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 526–537. https://doi.org/10.1109/SANER53432.2022.00070
  64. CigaR: Cost-efficient Program Repair with LLMs.
  65. Do Large Code Models Understand Programming Concepts? A Black-box Approach. arXiv:2402.05980 [cs.SE]
  66. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE]
  67. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790–2799.
  68. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
  69. Active Code Learning: Benchmarking Sample-Efficient Training of Code Models. arXiv:2306.01250 [cs.SE]
  70. Summarizing Source Code with Transferred API Knowledge. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). AAAI Press, 2269–2275.
  71. Do Not Give Away My Secrets: Uncovering the Privacy Issue of Neural Code Completion Tools. arXiv:2309.07639 [cs.CR]
  72. Where to Look When Repairing Code? Comparing the Attention of Neural Models and Developers. arXiv:2305.07287 [cs.SE]
  73. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  74. Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code. arXiv:2312.04004 [cs.SE]
  75. A Survey of Trojans in Neural Models of Source Code: Taxonomy and Techniques. arXiv:2305.03803 [cs.SE]
  76. TrojanedCM: A Repository of Trojaned Large Language Models of Code. arXiv:2311.14850 [cs.SE]
  77. Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering (Oulu, Finland) (EASE ’23). Association for Computing Machinery, New York, NY, USA, 398–405. https://doi.org/10.1145/3593434.3594236
  78. Saki Imai. 2022. Is GitHub Copilot a Substitute for Human Pair-Programming? An Empirical Study. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 319–321. https://doi.org/10.1145/3510454.3522684
  79. Enhancing Robustness of AI Offensive Code Generators via Data Augmentation. arXiv:2306.05079 [cs.LG]
  80. Samireh Jalali and Claes Wohlin. 2012. Systematic literature studies: Database searches vs. backward snowballing. In Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. 29–38. https://doi.org/10.1145/2372251.2372257
  81. Exploring the Learnability of Program Synthesizers by Novice Programmers. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 64, 15 pages. https://doi.org/10.1145/3526113.3545659
  82. Akshita Jha and Chandan K. Reddy. 2023. CodeAttack: code-based adversarial attacks for pre-trained programming language models. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’23/IAAI’23/EAAI’23). AAAI Press, Article 1670, 9 pages. https://doi.org/10.1609/aaai.v37i12.26739
  83. Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach. arXiv:2310.06680 [cs.SE]
  84. Unlearnable Examples: Protecting Open-Source Software from Unauthorized Neural Code Learning.. In SEKE. 525–530.
  85. ClawSAT: Towards Both Robust and Accurate Code Models. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE Computer Society, Los Alamitos, CA, USA, 212–223. https://doi.org/10.1109/SANER56733.2023.00029
  86. An Empirical Study of Model-Agnostic Techniques for Defect Prediction Models. IEEE Transactions on Software Engineering 48, 1 (2022), 166–185. https://doi.org/10.1109/TSE.2020.2982385
  87. Can ChatGPT Support Developers? An Empirical Evaluation of Large Language Models for Code Generation. arXiv preprint arXiv:2402.11702 (2024).
  88. Connecting the .dotfiles: Checked-In Secret Exposure with Extra (Lateral Movement) Steps. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). 322–333. https://doi.org/10.1109/MSR59073.2023.00051
  89. LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression. arXiv:2309.14021 [cs.CL]
  90. Studying the Effect of AI Code Generators on Supporting Novice Learners in Introductory Programming. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 455, 23 pages. https://doi.org/10.1145/3544548.3580919
  91. SEGRESS: Software Engineering Guidelines for REporting Secondary Studies. IEEE Transactions on Software Engineering 49, 03 (mar 2023), 1273–1298. https://doi.org/10.1109/TSE.2022.3174092
  92. A Probabilistic Approach to Source Code Authorship Identification. In Fourth International Conference on Information Technology (ITNG’07). 243–248. https://doi.org/10.1109/ITNG.2007.17
  93. Is Model Attention Aligned with Human Attention? An Empirical Study on Large Language Models for Code Generation. arXiv:2306.01220 [cs.SE]
  94. Evaluating Program Repair with Semantic-Preserving Transformations: A Naturalness Assessment. arXiv:2402.11892 [cs.SE]
  95. Benjamin Ledel and Steffen Herbold. 2022. Studying the explanations for the automated prediction of bug and non-bug issues using LIME and SHAP. arXiv:2209.07623 [cs.SE]
  96. Who Wrote this Code? Watermarking for Code Generation. arXiv:2305.15060 [cs.CL]
  97. Resilient Watermarking for LLM-Generated Codes. arXiv:2402.07518 [cs.CR]
  98. TextBugger: Generating Adversarial Text Against Real-world Applications. In Proceedings 2019 Network and Distributed System Security Symposium. Internet Society. https://doi.org/10.14722/ndss.2019.23138
  99. Poison Attack and Defense on Deep Source Code Processing Models. https://doi.org/10.48550/ARXIV.2210.17029
  100. ”Always Nice and Confident, Sometimes wrong”: Developer’s Experiences Engaging Generative AI Chatbots Versus Human-Powered Q&A Platforms. arXiv:2309.13684 [cs.HC]
  101. StarCoder: may the source be with you! arXiv:2305.06161 [cs.CL]
  102. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4582–4597. https://doi.org/10.18653/v1/2021.acl-long.353
  103. Multi-target Backdoor Attacks for Code Pre-trained Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 7236–7254. https://doi.org/10.18653/v1/2023.acl-long.399
  104. A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities. arXiv:2207.04285 [cs.SE]
  105. Semantic-Preserving Adversarial Code Comprehension. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 3017–3028. https://aclanthology.org/2022.coling-1.267
  106. Do Pretrained Language Models Indeed Understand Software Engineering Tasks? IEEE Transactions on Software Engineering 49, 10 (oct 2023), 4639–4655. https://doi.org/10.1109/TSE.2023.3308952
  107. RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1906–1918. https://doi.org/10.1145/3510003.3510181
  108. A Comparative Study of Adversarial Training Methods for Neural Models of Source Code. Future Gener. Comput. Syst. 142, C (may 2023), 165–181. https://doi.org/10.1016/j.future.2022.12.030
  109. Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (Rochester, MI, USA) (ASE ’22). Article 86, 13 pages. https://doi.org/10.1145/3551349.3556941
  110. Protecting Intellectual Property of Large Language Model-Based Code Generation APIs via Watermarks. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23). Association for Computing Machinery, New York, NY, USA, 2336–2350. https://doi.org/10.1145/3576915.3623120
  111. Robin: A Novel Method to Produce Robust Interpreters for Deep Learning-Based Code Classifiers. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, Los Alamitos, CA, USA, 27–39. https://doi.org/10.1109/ASE56229.2023.00164
  112. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. IEEE Transactions on Dependable and Secure Computing 19, 04 (jul 2022), 2244–2258. https://doi.org/10.1109/TDSC.2021.3051525
  113. A Large-Scale Survey on the Usability of AI Programming Assistants: Successes and Challenges. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 605–617. https://doi.ieeecomputersociety.org/
  114. EVIL: Exploiting Software via Natural Language. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE Computer Society, Los Alamitos, CA, USA, 321–332. https://doi.org/10.1109/ISSRE52982.2021.00042
  115. Can NMT Understand Me? Towards Perturbation-Based Evaluation of NMT Models for Code Generation. In Proceedings of the 1st International Workshop on Natural Language-Based Software Engineering (Pittsburgh, Pennsylvania) (NLBSE ’22). Association for Computing Machinery, New York, NY, USA, 59–66. https://doi.org/10.1145/3528588.3528653
  116. Zachary C. Lipton. 2018. The Mythos of Model Interpretability: In Machine Learning, the Concept of Interpretability is Both Important and Slippery. Queue 16, 3 (jun 2018), 31–57. https://doi.org/10.1145/3236386.3241340
  117. An Empirical Study of Parameter-Efficient Fine-Tuning Methods for Pre-Trained Code Models. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, Los Alamitos, CA, USA, 397–408. https://doi.org/10.1109/ASE56229.2023.00125
  118. A Practical Black-Box Attack on Source Code Authorship Identification Classifiers. IEEE Transactions on Information Forensics and Security 16 (2021), 3620–3633. https://doi.org/10.1109/TIFS.2021.3080507
  119. Delving into Parameter-Efficient Fine-Tuning in Code Change Learning: An Empirical Study. arXiv:2402.06247 [cs.SE]
  120. On the Reliability and Explainability of Automated Code Generation Approaches. 1, 1 (2023), 1–20. arXiv:2302.09587 http://arxiv.org/abs/2302.09587
  121. On the Reliability and Explainability of Language Models for Program Generation. ACM Trans. Softw. Eng. Methodol. (jan 2024). https://doi.org/10.1145/3641540 Just Accepted.
  122. David Lo. 2023. Trustworthy and Synergistic Artificial Intelligence for Software Engineering: Vision and Roadmaps. arXiv:2309.04142 [cs.SE]
  123. LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE Computer Society, Los Alamitos, CA, USA, 647–658. https://doi.org/10.1109/ISSRE59848.2023.00026
  124. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. CoRR (2021).
  125. Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. NeruIPS 30 (2017).
  126. The “code” of Ethics:A Holistic Audit of AI Code Generators.
  127. Are Code Pre-trained Models Powerful to Learn Code Syntax and Semantics? arXiv:2212.10017 [cs.SE]
  128. Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code. arXiv:2402.09299 [cs.SE]
  129. On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 2149–2160. https://doi.org/10.1109/ICSE48619.2023.00181
  130. Adversarial Authorship Attribution in Open-Source Projects. In Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy (Richardson, Texas, USA) (CODASPY ’19). Association for Computing Machinery, New York, NY, USA, 291–302. https://doi.org/10.1145/3292006.3300032
  131. Evolutionary Approaches for Adversarial Attacks on Neural Source Code Classifiers. Algorithms 16, 10 (2023). https://doi.org/10.3390/a16100478
  132. Meta. [n. d.]. Code Llama. https://ai.meta.com/blog/code-llama-large-language-model-coding/
  133. Equation of state calculations by fast computing machines. The journal of chemical physics 21, 6 (1953), 1087–1092.
  134. A Systematic Literature Review of Explainable AI for Software Engineering. arXiv:2302.06065 [cs.SE]
  135. Explainable AI for Pre-Trained Code Models: What Do They Learn? When They Do Not Work? (2022). arXiv:2211.12821 http://arxiv.org/abs/2211.12821
  136. Explaining Transformer-based Code Models: What Do They Learn? When They Do Not Work?. In 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE Computer Society, Los Alamitos, CA, USA, 96–106. https://doi.org/10.1109/SCAM59687.2023.00020
  137. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 1287–1293.
  138. When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming. arXiv:2306.04930 [cs.HC]
  139. DIP: Dead code Insertion based Black-box Attack for Programming Language Model. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 7777–7791. https://doi.org/10.18653/v1/2023.acl-long.430
  140. Stress Test Evaluation for Natural Language Inference. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2340–2353. https://aclanthology.org/C18-1198
  141. Adversarial Attacks to API Recommender Systems: Time to Wake Up and Smell the Coffee?. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 253–265. https://doi.org/10.1109/ASE51524.2021.9678946
  142. How Beginning Programmers and Code LLMs (Mis)read Each Other. arXiv:2401.15232 [cs.HC]
  143. Adversarial Attacks on Code Models with Discriminative Graph Patterns. arXiv:2308.11161 [cs.SE]
  144. Generative Artificial Intelligence for Software Engineering – A Research Agenda. arXiv:2310.18648 [cs.SE]
  145. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations.
  146. An Empirical Comparison of Pre-Trained Models of Source Code. arXiv:2302.04026 [cs.SE]
  147. CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot. In 32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 2133–2150.
  148. Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Suggestions from Poisoned AI Models. arXiv:2312.06227 [cs.CR]
  149. Evaluating and Explaining Large Language Models for Code Using Syntactic Structures. arXiv:2308.03873 [cs.SE]
  150. Extracting Meaningful Attention on Source Code: An Empirical Study of Developer and Neural Model Code Exploration. arXiv:2210.05506 [cs.SE]
  151. Matteo Paltenghi and Michael Pradel. 2021. Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 867–879. https://doi.org/10.1109/ASE51524.2021.9678712
  152. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 (2016).
  153. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022. IEEE, 754–768. https://doi.org/10.1109/SP46214.2022.9833571
  154. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv:2302.06590 [cs.SE]
  155. Illia Polosukhin and Alexander Skidanov. 2018. Neural Program Search: Solving Programming Tasks from Description and Examples. arXiv:1802.04335 [cs.AI]
  156. PyExplainer: Explaining the Predictions of Just-in-Time Defect Models. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (Melbourne, Australia) (ASE ’21). IEEE Press, 407–418. https://doi.org/10.1109/ASE51524.2021.9678763
  157. A Search-Based Testing Framework for Deep Neural Networks of Source Code Embedding. In 14th IEEE Conference on Software Testing, Verification and Validation, ICST 2021, Porto de Galinhas, Brazil, April 12-16, 2021. IEEE.
  158. “It’s Weird That It Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers. ACM Trans. Comput.-Hum. Interact. (aug 2023). https://doi.org/10.1145/3617367 Just Accepted.
  159. ONION: A Simple and Effective Defense Against Textual Backdoor Attacks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 9558–9566. https://doi.org/10.18653/v1/2021.emnlp-main.752
  160. BadCS: A Backdoor Attack Framework for Code search.
  161. Misleading Authorship Attribution of Source Code Using Adversarial Learning. In Proceedings of the 28th USENIX Conference on Security Symposium (Santa Clara, CA, USA) (SEC’19). USENIX Association, USA, 479–496.
  162. Md Rafiqul Islam Rabin and Mohammad Amin Alipour. 2021. Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving Transformations. arXiv:2004.07313 [cs.SE]
  163. Md Rafiqul Islam Rabin and Mohammad Amin Alipour. 2022. FeatureExtractor: A tool for extracting key input features of code intelligence models. Software Impacts 14 (2022), 100432. https://doi.org/10.1016/j.simpa.2022.100432
  164. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. Information and Software Technology 135 (2021), 106552. https://doi.org/10.1016/j.infsof.2021.106552
  165. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. Information and Software Technology 135 (2021), 106552.
  166. Understanding neural code intelligence through program simplification. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 441–452. https://doi.org/10.1145/3468264.3468539
  167. Memorization and generalization in neural code intelligence models. Information and Software Technology 153 (2023), 107066. https://doi.org/10.1016/j.infsof.2022.107066
  168. Testing Neural Program Analyzers. arXiv:1908.10711 [cs.LG]
  169. Goutham Ramakrishnan and Aws Albarghouthi. 2022. Backdoors in Neural Models of Source Code. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE Computer Society, Los Alamitos, CA, USA, 2892–2899. https://doi.org/10.1109/ICPR56361.2022.9956690
  170. Probabilistic Model for Code with Decision Trees. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Amsterdam, Netherlands) (OOPSLA 2016). Association for Computing Machinery, New York, NY, USA, 731–747. https://doi.org/10.1145/2983990.2984041
  171. Neural Network-Based Detection of Self-Admitted Technical Debt: From Performance to Explainability. ACM Trans. Softw. Eng. Methodol. 28, 3, Article 15 (jul 2019), 45 pages. https://doi.org/10.1145/3324916
  172. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. https://doi.org/10.1145/2939672.2939778
  173. Benchmarking Causal Study to Interpret Large Language Models for Source Code. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 329–334. https://doi.org/10.1109/ICSME58846.2023.00040
  174. Why Don’t XAI Techniques Agree? Characterizing the Disagreements Between Post-hoc Explanations of Defect Predictions. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 444–448. https://doi.org/10.1109/ICSME55016.2022.00056
  175. Mootez Saad and Tushar Sharma. 2023. Naturalness of Attention: Revisiting Attention in Code Language Models. arXiv:2311.13508 [cs.SE]
  176. Utilization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Software Engineering. arXiv:2307.08540 [cs.SE]
  177. Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants. In 32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 2205–2222. https://www.usenix.org/conference/usenixsecurity23/presentation/sandoval
  178. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 1559–1575.
  179. An Exploratory Study on Code Attention in BERT. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension (Virtual Event) (ICPC ’22). Association for Computing Machinery, New York, NY, USA, 437–448. https://doi.org/10.1145/3524610.3527921
  180. Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey. arXiv:2310.17903 [cs.SE]
  181. Structural-semantics Guided Program Simplification for Understanding Neural Code Intelligence Models. In Proceedings of the 14th Asia-Pacific Symposium on Internetware (Internetware ’23). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3609437.3609438
  182. Towards Efficient Fine-Tuning of Pre-trained Code Models: An Experimental Study and Beyond. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 39–51. https://doi.org/10.1145/3597926.3598036
  183. Smaller, Faster, Greener: Compressing Pre-trained Code Models via Surrogate-Assisted Optimization. arXiv preprint arXiv:2309.04076 (2023).
  184. Compressing Pre-Trained Models of Code into 3 MB (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 24, 12 pages. https://doi.org/10.1145/3551349.3556964
  185. Explainable Software Defect Prediction: Are We There Yet?
  186. Exploring the Robustness of Large Language Models for Solving Programming Problems. arXiv:2306.14583 [cs.CL]
  187. Paraphrasing Techniques for Maritime QA system. arXiv:2203.10854 [cs.CL]
  188. An Empirical Study of Code Smells in Transformer-based Code Generation Techniques. In 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM). 71–82. https://doi.org/10.1109/SCAM55253.2022.00014
  189. Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution. Proc. Priv. Enhancing Technol. 2018, 1 (2018), 127–144.
  190. Leo Song and Steven H.H. Ding. 2023. Milo: Attacking Deep Pre-trained Model for Programming Languages Tasks with Anti-analysis Code Obfuscation. In COMPSAC. 586–594. https://doi.org/10.1109/COMPSAC57700.2023.00084
  191. STRATA: Simple, Gradient-Free Attacks for Models of Code.
  192. Generating Adversarial Computer Programs using Optimized Obfuscations. ICLR 16 (2021), 209–226.
  193. Mateusz Staniak and Przemyslaw Biecek. 2018. Explanations of model predictions with live and breakDown packages. (2018).
  194. Chia-Yi Su and Collin McMillan. 2024. Distilled GPT for source code summarization. Automated Software Engineering 31, 1 (2024), 22. https://doi.org/10.1007/s10515-024-00421-4
  195. Investigating Explainability of Generative AI for Code through Scenario-Based Design. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 212–228. https://doi.org/10.1145/3490099.3511119
  196. Backdooring Neural Code Search. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 9692–9708. https://doi.org/10.18653/v1/2023.acl-long.540
  197. CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 1561–1572. https://doi.org/10.1145/3611643.3616297
  198. CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW ’22). Association for Computing Machinery, New York, NY, USA, 652–660. https://doi.org/10.1145/3485447.3512225
  199. When Neural Code Completion Models Size up the Situation: Attaining Cheaper and Faster Completion through Dynamic Model Inference. arXiv:2401.09964 [cs.SE]
  200. Don’t Complete It! Preventing Unhelpful Code Completion for Productive and Sustainable Neural Code Completion Systems. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 324–325. https://doi.org/10.1109/ICSE-Companion58688.2023.00089
  201. Towards a Big Data Curated Benchmark of Inter-project Code Clones. In 2014 ICSME. 476–480. https://doi.org/10.1109/ICSME.2014.77
  202. Fast and memory-efficient neural code completion. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 329–340.
  203. Is ChatGPT the Ultimate Programming Assistant – How far is it? arXiv:2304.11938 [cs.SE]
  204. Generating Adversarial Examples of Source Code Classification Models via Q-Learning-Based Markov Decision Process. In 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). 807–818. https://doi.org/10.1109/QRS54544.2021.00090
  205. Code Difference Guided Adversarial Example Generation for Deep Code Models. , 850-862 pages.
  206. Spectral Signatures in Backdoor Attacks. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc.
  207. Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, and Sarah Wiegreffe (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 371–383. https://doi.org/10.18653/v1/2022.blackboxnlp-1.31
  208. An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. ACM Trans. Softw. Eng. Methodol. 28, 4, Article 19 (sep 2019), 29 pages. https://doi.org/10.1145/3340544
  209. Towards More Effective AI-Assisted Programming: A Systematic Design Exploration to Improve Visual Studio IntelliCode’s User Experience. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 185–195. https://doi.org/10.1109/ICSE-SEIP58684.2023.00022
  210. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 332, 7 pages. https://doi.org/10.1145/3491101.3519665
  211. Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions. arXiv:2302.07248 [cs.HC]
  212. You See What I Want You to See: Poisoning Vulnerabilities in Neural Code Search. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 1233–1245. https://doi.org/10.1145/3540250.3549153
  213. What Do They Capture? A Structural Analysis of Pre-Trained Language Models for Source Code. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2377–2388. https://doi.org/10.1145/3510003.3510050
  214. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=GF9cSKI3A_q
  215. One Adapter for All Programming Languages? Adapter Tuning for Code Search and Summarization. arXiv:2303.15822 [cs.SE]
  216. Investigating and Designing for Trust in AI-powered Code Generation Tools. arXiv:2305.11248 [cs.HC]
  217. ReCode: Robustness Evaluation of Code Generation Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 13818–13843. https://doi.org/10.18653/v1/2023.acl-long.773
  218. Detecting and Explaining Self-Admitted Technical Debts with Attention-Based Neural Networks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (Virtual Event, Australia) (ASE ’20). Association for Computing Machinery, New York, NY, USA, 871–882. https://doi.org/10.1145/3324884.3416583
  219. Robust learning against relational adversaries. Advances in Neural Information Processing Systems 35 (2022), 16246–16260.
  220. Yu Wang and Ke Wang. 2023. Demystifying What Code Summarization Models Learned. arXiv:2303.02333 [cs.PL]
  221. An Explanation Method for Models of Code. Proc. ACM Program. Lang. 7, OOPSLA2, Article 250 (oct 2023), 27 pages. https://doi.org/10.1145/3622826
  222. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021.
  223. CoCoFuzzing: Testing Neural Code Models With Coverage-Guided Fuzzing. IEEE Transactions on Reliability (2022), 1–14. https://doi.org/10.1109/TR.2022.3208239
  224. Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study (ESEC/FSE 2023). 224–236. https://doi.org/10.1145/3611643.3616302
  225. Better Together? An Evaluation of AI-Supported Code Translation. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 369–391. https://doi.org/10.1145/3490099.3511157
  226. Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models. arXiv:2308.10462 [cs.SE]
  227. Claes Wohlin. 2014. Guidelines for Snowballing in Systematic Literature Studies and a Replication in Software Engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (London, England, United Kingdom) (EASE ’14). Association for Computing Machinery, New York, NY, USA, Article 38, 10 pages. https://doi.org/10.1145/2601248.2601268
  228. DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions. arXiv:2312.04730 [cs.CR]
  229. DevGPT: Studying Developer-ChatGPT Conversations. arXiv preprint arXiv:2309.03914 (2023).
  230. Towards Privacy Preserving Cross Project Defect Prediction with Federated Learning. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 485–496. https://doi.org/10.1109/SANER56733.2023.00052
  231. COCO: Testing Code Generation Systems via Concretized Instructions. arXiv:2308.13319 [cs.SE]
  232. How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective. https://doi.org/10.48550/ARXIV.2211.15844
  233. How Important Are Good Method Names in Neural Code Generation? A Model Robustness Perspective. ACM Trans. Softw. Eng. Methodol. (oct 2023). https://doi.org/10.1145/3630010 Just Accepted.
  234. Assessing and Improving Syntactic Adversarial Robustness of Pre-trained Models for Code Translation. arXiv:2310.18587 [cs.SE]
  235. Authorship attribution of source code by using back propagation neural network based on particle swarm optimization. PloS one 12, 11 (2017), e0187204.
  236. An Empirical Study of Model-Agnostic Interpretation Technique for Just-in-Time Software Defect Prediction. In Collaborative Computing: Networking, Applications and Worksharing, Honghao Gao and Xinheng Wang (Eds.). Springer International Publishing, Cham, 420–438.
  237. Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1482–1493. https://doi.org/10.1145/3510003.3510146
  238. Stealthy Backdoor Attack for Code Models. IEEE Transactions on Software Engineering 01 (feb [n. d.]), 1–21. https://doi.org/10.1109/TSE.2024.3361661
  239. Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models. arXiv:2310.01166 [cs.SE]
  240. Unveiling Memorization in Code Models.
  241. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020).
  242. AdVulCode: Generating Adversarial Vulnerable Code against Deep Learning-Based Vulnerability Detectors. Electronics 12, 4 (2023). https://doi.org/10.3390/electronics12040936
  243. An Extensive Study on Pre-Trained Models for Program Understanding and Generation. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, South Korea) (ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 39–51. https://doi.org/10.1145/3533767.3534390
  244. Practices and Challenges of Using GitHub Copilot: An Empirical Study. In International Conferences on Software Engineering and Knowledge Engineering. KSI Research Inc. https://doi.org/10.18293/seke2023-077
  245. Transfer Attacks and Defenses for Large Language Models on Coding Tasks. arXiv:2311.13445 [cs.LG]
  246. Towards robustness of deep program processing models—detection, estimation, and enhancement. TOSEM 31, 3 (2022), 1–40.
  247. Generating Adversarial Examples for Holding Robustness of Source Code Processing Models. Proceedings of the AAAI Conference on Artificial Intelligence 34, 01 (Apr. 2020), 1169–1176.
  248. CodeBERT-Attack: Adversarial attack against source code deep learning models via pre-trained model. Journal of Software: Evolution and Process ([n. d.]), e2571.
  249. RNNS: Representation Nearest Neighbor Search Black-Box Attack on Code Models. arXiv:2305.05896 [cs.CR]
  250. Android in the Zoo: Chain-of-Action-Thought for GUI Agents. arXiv:2403.02713 [cs.CL]
  251. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. PMLR, 11328–11339.
  252. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48, 1 (2022), 1–36. https://doi.org/10.1109/TSE.2019.2962027
  253. What does Transformer learn about source code? arXiv:2207.08466 [cs.SE]
  254. Sheng Zhang and Hui Li. 2023. Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models.
  255. Interpretable Program Synthesis. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 105, 16 pages. https://doi.org/10.1145/3411764.3445646
  256. Challenging Machine Learning-Based Clone Detectors via Semantic-Preserving Code Transformations. TSE 49, 5 (2023), 3052–3070. https://doi.org/10.1109/TSE.2023.3240118
  257. Diet Code is Healthy: Simplifying Programs for Pre-Trained Models of Code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 1073–1084. https://doi.org/10.1145/3540250.3549094
  258. Interpretability application of the Just-in-Time software defect prediction model. Journal of Systems and Software 188 (2022), 111245. https://doi.org/10.1016/j.jss.2022.111245
  259. A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends. arXiv:2311.10372 [cs.SE]
  260. On the Concerns of Developers When Using GitHub Copilot. arXiv:2311.01020 [cs.SE]
  261. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. Curran Associates Inc., Red Hook, NY, USA.
  262. Adversarial Robustness of Deep Code Comment Generation. ACM Trans. Softw. Eng. Methodol. 31, 4, Article 60 (jul 2022), 30 pages. https://doi.org/10.1145/3501256
  263. Interpretable Text-to-SQL Generation with Joint Optimization. In Web Information Systems and Applications: 17th International Conference, WISA 2020, Guangzhou, China, September 23–25, 2020, Proceedings (Guangzhou, China). Springer-Verlag, Berlin, Heidelberg, 341–351. https://doi.org/10.1007/978-3-030-60029-7_32
  264. Rui Zhu and Cunming Zhang. 2023. How Robust Is a Large Pre-trained Language Model for Code Generationf A Case on Attacking GPT2. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 708–712. https://doi.org/10.1109/SANER56733.2023.00076
  265. An Empirical Study of Gradient-based Explainability Techniques for Self-admitted Technical Debt Detection. Journal of Internet Technology 23, 3 (2022), 631–641.
  266. On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model: An Empirical Study on Codex. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, 1090–1102.
  267. Source Code Data Augmentation for Deep Learning: A Survey. arXiv:2305.19915 [cs.CL]
  268. Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models. arXiv:2401.00788 [cs.CL]
  269. Productivity Assessment of Neural Code Completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (San Diego, CA, USA) (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 21–29. https://doi.org/10.1145/3520312.3534864
  270. Exploring and Evaluating Personalized Models for Code Generation (ESEC/FSE 2022). 1500–1508. https://doi.org/10.1145/3540250.3558959
  271. Interpreting Deep Learning-Based Vulnerability Detector Predictions Based on Heuristic Searching. ACM Trans. Softw. Eng. Methodol. 30, 2, Article 23 (mar 2021), 31 pages. https://doi.org/10.1145/3429444
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhou Yang (82 papers)
  2. Zhensu Sun (15 papers)
  3. Terry Zhuo Yue (1 paper)
  4. Premkumar Devanbu (25 papers)
  5. David Lo (229 papers)
Citations (25)

Summary

An Examination of Non-Functional Properties in LLMs for Code

The paper "Robustness, Security, Privacy, Explainability, Efficiency, and Usability of LLMs for Code" by Yang et al. provides a comprehensive analysis of the non-functional properties of LLMs tailored for code, referred to as LLM4Code. This examination is crucial given the increasing integration of LLM4Code into software engineering practices and the substantial impact they have on various software engineering tasks, such as code generation and vulnerability detection.

Overview and Methodology

The authors acknowledge the transformative influence of LLM4Code, such as GitHub Copilot and Amazon CodeWhisperer, on software engineering. However, they highlight a gap in the systematic paper of non-functional properties beyond accuracy. To address this, they conducted a systematic literature review, scrutinizing 146 relevant studies to identify and evaluate these properties: robustness, security, privacy, explainability, efficiency, and usability.

The authors' methodology comprised collecting relevant literature primarily from DBLP, supplemented by snowballing techniques. The inclusion of both LLM4Code and earlier models, such as Code2Vec, allows for a broader perspective on the evolution of code models' non-functional requirements.

Evaluation of Non-Functional Properties

  1. Robustness: The authors describe robustness in LLM4Code as the model's ability to maintain performance consistency despite input perturbations. Various adversarial attack strategies, such as gradient-based and search-based methods, are used to evaluate robustness. They note a distinct lack of scalability in current defense methods for larger models, identifying this as a significant research area.
  2. Security: Security threats, particularly data poisoning and backdoor attacks, are considered significant risks to LLM4Code. The authors recognize the deficiency in current detection methods for sophisticated, stealthy attacks and emphasize the need for more effective defense strategies.
  3. Privacy: The leakage of sensitive information, including member-specific data through membership inference attacks, is a critical concern. The authors address the challenges of mitigating such threats and protecting unauthorized data usage, suggesting areas where further exploration is necessary, like differential privacy.
  4. Explainability: Divergences in explanations provided by different techniques highlight the need for reliable interpretability methods. Most current studies focus on classification tasks, with a call for more comprehensive exploration of generative tasks in LLM4Code to better meet user needs.
  5. Efficiency: They observe a growing trend toward parameter-efficient fine-tuning methods, which can significantly reduce the computational costs associated with training large models. However, they indicate that alternative efficiency strategies, such as quantization and model pruning, warrant further research.
  6. Usability: Usability evaluations reveal mixed impacts on developer productivity, with both positive and negative experiences reported with tools like Copilot. Research gaps remain in understanding and enhancing usability across a broader range of applications beyond code completion.

Implications and Future Directions

The paper explores the practical and theoretical implications of these non-functional properties. Practically, the insights could guide the development of better models by balancing these attributes. Theoretically, it sheds light on underpinning challenges and paves the way for future developments in LLM4Code.

The authors propose three perspectives to guide future research: a data-centric view focusing on improving training datasets' quality, a human-centric view emphasizing user trust and usability, and a system-centric view addressing security, efficiency, and scalability. These approaches are poised to foster more robust, secure, and user-friendly LLM4Code systems.

Overall, Yang et al.'s paper serves as a foundational exploration of the nuances and complexities of LLM4Code, providing a critical overview and a roadmap for further academic inquiry into these high-impact models.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com