Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit (2401.00288v1)

Published 30 Dec 2023 in cs.SE and cs.AI

Abstract: Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora, with the aim of developing intelligent tools to improve the quality and productivity of computer programming. Currently, there is already a thriving research community focusing on code intelligence, with efforts ranging from software engineering, machine learning, data mining, natural language processing, and programming languages. In this paper, we conduct a comprehensive literature review on deep learning for code intelligence, from the aspects of code representation learning, deep learning techniques, and application tasks. We also benchmark several state-of-the-art neural models for code intelligence, and provide an open-source toolkit tailored for the rapid prototyping of deep-learning-based code intelligence models. In particular, we inspect the existing code intelligence models under the basis of code representation learning, and provide a comprehensive overview to enhance comprehension of the present state of code intelligence. Furthermore, we publicly release the source code and data resources to provide the community with a ready-to-use benchmark, which can facilitate the evaluation and comparison of existing and future code intelligence models (https://xcodemind.github.io). At last, we also point out several challenging and promising directions for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (300)
  1. 2019. GitHub. https://www.github.com. [Online; accessed 1-May-2019].
  2. 2019. StackOverflow. https://www.stackoverflow.com. [Online; accessed 1-May-2019].
  3. Unified Pre-training for Program Understanding and Generation. In NAACL. 2655–2668.
  4. A Transformer-based Approach for Source Code Summarization. In ACL. 4998–5007.
  5. Toufique Ahmed and Premkumar Devanbu. 2022. Multilingual training for Software Engineering. In ICSE.
  6. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–37.
  7. Typilus: neural type hints. In PLDI. 91–105.
  8. Miltiadis Allamanis and Marc Brockschmidt. 2017. Smartpaste: Learning to adapt source code. arXiv:1705.07867 (2017).
  9. Learning to Represent Programs with Graphs. In ICLR.
  10. A convolutional attention network for extreme summarization of source code. In ICML. 2091–2100.
  11. code2seq: Generating Sequences from Structured Representations of Code. In ICLR.
  12. Structural language models of code. In ICML. 245–256.
  13. code2vec: Learning distributed representations of code. POPL 3 (2019), 1–29.
  14. Marc Andreessen. 2011. Why software is eating the world. Wall Street Journal 20, 2011 (2011), C2.
  15. DeepCoder: Learning to Write Programs. In ICLR.
  16. Project-Level Encoding for Neural Source Code Summarization of Subroutines. In ICPC. IEEE, 253–264.
  17. Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation. In IJCNLP. 314–319.
  18. AutoPandas: neural-backed generators for program synthesis. OOPSLA 3 (2019), 1–27.
  19. Islam Beltagy and Chris Quirk. 2016. Improved semantic parsers for if-then statements. In ACL. 726–736.
  20. Neural Code Comprehension: A Learnable Representation of Code Semantics. In NeurIPS. 3589–3601.
  21. TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer. In ICML. 780–791.
  22. Sahil Bhatia and Rishabh Singh. 2016. Automated correction for syntax errors in programming assignments using recurrent neural networks. arXiv:1603.06129 (2016).
  23. Pavol Bielik and Martin Vechev. 2020. Adversarial robustness for code. In ICML. 896–907.
  24. Generative Code Modeling with Graphs. In ICLR.
  25. A structural model for contextual code changes. OOPSLA 4 (2020), 1–28.
  26. Language models are few-shot learners. NeurIPS 33 (2020), 1877–1901.
  27. Lutz Büch and Artur Andrzejak. 2019. Learning-based recursive aggregation of abstract syntax trees for code clone detection. In SANER. 95–104.
  28. Bilateral dependency neural networks for cross-language algorithm classification. In SANER. 422–433.
  29. SAR: learning cross-language API mappings with little knowledge. In ESEC/FSE. 796–806.
  30. InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees. In ICSE. 1186–1197.
  31. Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations. In SIGIR. ACM, 511–521.
  32. An Encoder-Decoder Framework Translating Natural Language to Database Queries. In IJCAI. 3977–3983.
  33. When deep learning met code search. In ESEC/FSE. 964–974.
  34. MVD: Memory-Related Vulnerability Detection Based on Flow-Sensitive Graph Neural Networks. In ICSE. 1456–1468.
  35. On evaluating adversarial robustness. arXiv:1902.06705 (2019).
  36. Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In S&P. 39–57.
  37. ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages. arXiv preprint arXiv:2212.06742 (2022).
  38. Cross-Domain Deep Code Search with Meta Learning. In ICSE. 487–498.
  39. Codit: Code editing with tree-based neural models. TSE (2020).
  40. Saikat Chakraborty and Baishakhi Ray. 2021. On Multi-Modal Learning of Editing Source Code. In ASE. IEEE, 443–455.
  41. On the transferability of pre-trained language models for low-resource programming languages. In ICPC. ACM, 401–412.
  42. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  43. Tree-to-tree Neural Networks for Program Translation. In NeurIPS. 2552–2562.
  44. PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair. NeurIPS 34 (2021).
  45. Sequencer: Sequence-to-sequence learning for end-to-end program repair. TSE (2019).
  46. DeepWukong: Statically detecting software vulnerabilities using deep graph neural network. TOSEM 30, 3 (2021), 1–33.
  47. Davide Chicco. 2021. Siamese neural networks: An overview. Artificial Neural Networks (2021), 73–94.
  48. Nadezhda Chirkova and Sergey Troshin. 2021a. Empirical study of transformers for source code. In ESEC/FSE. 703–715.
  49. Nadezhda Chirkova and Sergey Troshin. 2021b. A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code. In NAACL. 278–288.
  50. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  51. Pangu-coder: Program synthesis with function-level language modeling. arXiv preprint arXiv:2207.11280 (2022).
  52. Suggesting comment completions for python using neural language models. In SANER. 456–467.
  53. Zero-shot program representation learning. In ICPC. ACM, 60–70.
  54. ProGraML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations. In ICML.
  55. Synthesizing benchmarks for predictive modeling. In CGO. 86–99.
  56. Open vocabulary learning on source code with a graph-structured cache. In ICML. 1475–1485.
  57. Automatic feature learning for predicting vulnerable software components. TSE (2018).
  58. Fine-grained Co-Attentive Representation Learning for Semantic Code Search. In SANER. 396–407.
  59. Deep Learning & Software Engineering: State of Research and Future Directions. CoRR abs/2009.08525 (2020).
  60. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171–4186.
  61. Robustfill: Neural program learning under noisy i/o. In ICML. 990–998.
  62. Hoppity: Learning graph transformations to detect and fix bugs in programs. In ICLR.
  63. Towards Learning (Dis)-Similarity of Source Code from Program Contrasts. In ACL. 6300–6312.
  64. Li Dong and Mirella Lapata. 2016. Language to Logical Form with Neural Attention. In ACL.
  65. Robust physical-world attacks on deep learning visual classification. In CVPR. 1625–1634.
  66. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of EMNLP. 1536–1547.
  67. Structured Neural Summarization. In ICLR.
  68. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
  69. VulRepair: a T5-based automated software vulnerability repair. In ESEC/FSE. 935–947.
  70. Yuexiu Gao and Chen Lyu. 2022. M2TS: multi-scale multi-modal approach based on transformer for source code summarization. In ICPC. ACM, 24–35.
  71. Automating the removal of obsolete TODO comments. In ESEC/FSE. 218–229.
  72. AllenNLP: A Deep Semantic Natural Language Processing Platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS). 1–6.
  73. Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning. (2024).
  74. Source Code Summarization with Structural Relative Position Guided Transformer. In SANER. 13–24.
  75. Neural turing machines. arXiv:1410.5401 (2014).
  76. Accelerating Code Search with Deep Hashing and Code Classification. In ACL. 2534–2544.
  77. Deep code search. In ICSE. 933–944.
  78. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 631–642.
  79. DeepAM: Migrate APIs with Multi-Modal Sequence to Sequence Learning. In IJCAI. 3675–3681.
  80. Cross-Language Binary-Source Code Matching with Intermediate Representations. In SANER.
  81. Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
  82. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In ACL. 7212–7225.
  83. GraphCodeBERT: Pre-training Code Representations with Data Flow. In ICLR.
  84. Learning to Complete Code with Sketches. In ICLR.
  85. Modeling Hierarchical Syntax Structure with Triplet Position for Source Code Summarization. In ACL. 486–500.
  86. Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis. In NeurIPS.
  87. Deep reinforcement learning for programming language correction. arXiv:1801.10467 (2018).
  88. Neural attribution for semantic bug-localization in student programs. NeurIPS 32 (2019).
  89. Deepfix: Fixing common c language errors by deep learning. In AAAI.
  90. On the effectiveness of pretrained models for API learning. In ICPC. ACM, 309–320.
  91. A Multi-Perspective Architecture for Semantic Code Search. In ACL. 8563–8568.
  92. Action Word Prediction for Neural Source Code Summarization. In SANER. IEEE, 330–341.
  93. Improved automatic summarization of subroutines via attention to file context. In MSR. 300–310.
  94. Learning to Repair Software Vulnerabilities with Generative Adversarial Networks. In NeurIPS. 7944–7954.
  95. MaxSMT-based type inference for Python 3. In International Conference on Computer Aided Verification. 12–19.
  96. Learning to generate corrective patches using neural machine translation. arXiv:1812.07170 (2018).
  97. Retrieval-Based Neural Code Generation. In EMNLP. 925–930.
  98. Deep learning type inference. In ESEC/FSE. 152–162.
  99. Code vectors: Understanding programs through embedded abstracted symbolic traces. In ESEC/FSE. 163–174.
  100. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554.
  101. Cc2vec: Distributed representations of code changes. In ICSE. 518–529.
  102. Deep code comment generation. In ICPC. 200–20010.
  103. Summarizing source code with transferred api knowledge.(2018). In IJCAI, Vol. 19. 2269–2275.
  104. TreeCen: Building Tree Graph for Scalable Semantic Code Clone Detection. In ASE. ACM, 109:1–109:12.
  105. Prompt-tuned Code Language Model as a Neural Knowledge Base for Type Inference in Statically-Typed Partial Code. In ASE. ACM, 79:1–79:13.
  106. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv:1909.09436 (2019).
  107. Summarizing source code using a neural attention model. In ACL. 2073–2083.
  108. Mapping Language to Code in Programmatic Context. In EMNLP. 1643–1652.
  109. Contrastive Code Representation Learning. In EMNLP. 5954–5971.
  110. An unsupervised approach for discovering relevant tutorial fragments for APIs. In ICSE. 38–48.
  111. A manual inspection of Defects4J bugs and its implications for automatic program repair. Sci. China Inf. Sci. 62, 10 (2019), 200102:1–200102:16.
  112. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. In ICSE. 1161–1173.
  113. TreeBERT: A tree-based pre-trained model for programming language. In Uncertainty in Artificial Intelligence. 54–63.
  114. Automatically Generating Code Comment Using Heterogeneous Graph Neural Networks. In SANER. 1078–1088.
  115. Learning and Evaluating Contextual Embedding of Source Code. In ICML. 5110–5121.
  116. Big code!= big vocabulary: Open-vocabulary models for source code. In ICSE. 1073–1085.
  117. Code prediction by feeding trees to transformers. In ICSE. 150–162.
  118. DOBF: A Deobfuscation Pre-Training Objective for Programming Languages. In NeurIPS. 14967–14979.
  119. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO. 75–86.
  120. Maximal divergence sequential autoencoder for binary software vulnerability detection. In ICLR.
  121. Adapting neural text classification for improved software categorization. In ICSME. 461–472.
  122. Improved code summarization via a graph neural network. In ICPC. 184–195.
  123. A neural model for generating natural language summaries of program subroutines. In ICSE. 795–806.
  124. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL. 7871–7880.
  125. Enabling Programming Thinking in Large Language Models Toward Code Generation. arXiv preprint arXiv:2305.06599 (2023).
  126. Large Language Model-Aware In-Context Learning for Code Generation. arXiv preprint arXiv:2310.09748 (2023).
  127. EditSum: A Retrieve-and-Edit Framework for Source Code Summarization. In ASE. 155–166.
  128. Code Completion with Neural Attention and Pointer Networks. In IJCAI. 4159–4165.
  129. AUGER: automatically generating review comments with pre-training models. In ESEC/FSE. 1009–1021.
  130. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  131. CodeRetriever: Unimodal and Bimodal Contrastive Learning. In EMNLP.
  132. Competition-Level Code Generation with AlphaCode. Science 378, 6624 (2022), 1092–1097.
  133. Gated Graph Sequence Neural Networks. In ICLR.
  134. Dlfix: Context-based code transformation learning for automated program repair. In ICSE. 602–614.
  135. Fault Localization with Code Coverage Representation Learning. In ICSE. 661–673.
  136. Vulnerability detection with fine-grained interpretations. In ESEC/FSE. 292–303.
  137. DEAR: A Novel Deep Learning-based Approach for Automated Program Repair. In ICSE. 511–523.
  138. Improving bug detection via context-based code representation learning and attention-based neural networks. OOPSLA 3 (2019), 1–30.
  139. Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings. In ICSE. 2253–2265.
  140. SySeVR: A framework for using deep learning to detect software vulnerabilities. TDSC (2021).
  141. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In NDSS.
  142. Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting. In ICPC. IEEE, 184–195.
  143. Graph Neural Network Based Collaborative Filtering for API Usage Recommendation. In SANER. IEEE, 36–47.
  144. Latent Predictor Networks for Code Generation. In ACL. 599–609.
  145. Deep Graph Matching and Searching for Semantic Code Retrieval. TKDD 15, 5 (2021), 88:1–88:21.
  146. Improving ChatGPT Prompt for Code Generation. arXiv preprint arXiv:2305.08360 (2023).
  147. Latent attention for if-then program synthesis. NeurIPS 29 (2016), 4574–4582.
  148. Neural code completion. (2016).
  149. A Self-Attentional Neural Architecture for Code Completion with Multi-Task Learning. In ICPC. 37–47.
  150. Multi-task learning based pre-trained language model for code completion. In ASE. 473–485.
  151. Modeling programs hierarchically with stack-augmented LSTM. Journal of Systems and Software 164 (2020), 110547.
  152. Learning to spot and refactor inconsistent method names. In ICSE. 1–12.
  153. A Practical Black-box Attack on Source Code Authorship Identification Classifiers. TIFS (2021).
  154. Retrieval-Augmented Generation for Code Summarization via Hybrid GNN. In ICLR.
  155. Combining Graph Neural Networks with Expert Knowledge for Smart Contract Vulnerability Detection. TKDE (2021).
  156. Automating just-in-time comment updating. In ASE. 585–597.
  157. AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models. In ASE.
  158. ReACC: A Retrieval-Augmented Code Completion Framework. In ACL. 6227–6240.
  159. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In NeurIPS Datasets and Benchmarks.
  160. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv preprint arXiv:2306.08568 (2023).
  161. Chris Maddison and Daniel Tarlow. 2014. Structured generative models of natural source code. In ICML. 649–657.
  162. NL2Type: inferring JavaScript function types from natural language information. In ICSE. 304–315.
  163. Studying the usage of text-to-text transfer transformer to support code-related tasks. In ICSE. 336–347.
  164. Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks. TSE (2021).
  165. DeepDelta: learning to repair compilation errors. In ESEC/FSE. 925–936.
  166. Efficient Estimation of Word Representations in Vector Space. In ICLR.
  167. Type4Py: Practical Deep Similarity Learning-Based Type Inference for Python. In ICSE. 2241–2252.
  168. How can I use this method?. In ICSE, Vol. 1. 880–890.
  169. Convolutional neural networks over tree structures for programming language processing. In AAAI, Vol. 30.
  170. Automatic Comment Generation via Multi-Pass Deliberation. In ASE. ACM, 14:1–14:12.
  171. Clcdsa: cross language code clone detection using syntactical features and api documentation. In ASE. 1026–1037.
  172. funcGNN: A Graph Neural Network Approach to Program Similarity. In ESEM. 1–11.
  173. HISyn: human learning-inspired natural language programming. In ESEC/FSE. 75–86.
  174. Adversarial Attacks to API Recommender Systems: Time to Wake Up and Smell the Coffee?. In ASE. 253–265.
  175. Suggesting natural method names to check name consistencies. In ICSE. 1372–1384.
  176. Exploring API embedding for API usages and applications. In ICSE. 438–449.
  177. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  178. SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In ICSE. 1–13.
  179. Learning to infer program sketches. In ICML. 4861–4870.
  180. OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt/.
  181. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In NAACL-HLT: Demonstrations.
  182. OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints. arXiv:2004.00348 (2020).
  183. Deep Just-In-Time Inconsistency Detection Between Comments and Source Code. In AAAI, Vol. 35. 427–435.
  184. Learning to Update Natural Language Comments Based on Code Changes. In ACL. 1853–1868.
  185. How could Neural Networks understand Programs?. In ICML, Vol. 139. 8476–8486.
  186. Synchromesh: Reliable Code Generation from Pre-trained Language Models. In ICLR.
  187. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models. In ASE. 407–418.
  188. Typewriter: Neural type prediction with search-based validation. In ESEC/FSE. 209–220.
  189. Michael Pradel and Koushik Sen. 2018. Deepbugs: A learning approach to name-based bug detection. OOPSLA 2 (2018), 1–25.
  190. Misleading authorship attribution of source code using adversarial learning. In USENIX Security 19. 479–496.
  191. Understanding neural code intelligence through program simplification. In ESEC/FSE. ACM, 441–452.
  192. Abstract Syntax Networks for Code Generation and Semantic Parsing. In ACL. 1139–1149.
  193. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 21 (2020), 1–67.
  194. Goutham Ramakrishnan and Aws Albarghouthi. 2022. Backdoors in Neural Models of Source Code. In ICPR. IEEE, 2892–2899.
  195. Semantic robustness of models of source code. arXiv:2002.03043 (2020).
  196. Probabilistic model for code with decision trees. ACM SIGPLAN Notices 51, 10 (2016), 731–747.
  197. Code completion with statistical language models. In ICPC. 419–428.
  198. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  199. Unsupervised Translation of Programming Languages. In NeurIPS.
  200. Leveraging Automated Unit Tests for Unsupervised Code Translation. In ICLR.
  201. Michael Salib. 2004. Faster than C: Static type inference with Starkiller. PyCon Proceedings, Washington DC 3 (2004).
  202. Syntax and sensibility: Using language models to detect and correct syntax errors. In SANER. 311–322.
  203. You autocomplete me: Poisoning vulnerabilities in neural code completion. In USENIX Security.
  204. Explanation-Guided Backdoor Poisoning Attacks Against Malware Classifiers. In USENIX Security.
  205. API2Com: On the Improvement of Automatically Generated Code Comments Using API Documentations. In ICPC. IEEE, 411–421.
  206. An exploratory study on code attention in BERT. In ICPC. ACM, 437–448.
  207. CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees. In EMNLP. 4053–4062.
  208. Compressing Pre-trained Models of Code into 3 MB. In ASE. ACM, 24:1–24:12.
  209. Are we building on the rock? on the importance of data preprocessing for code summarization. In ESEC/FSE. ACM, 107–119.
  210. How to better utilize code graphs in semantic code search?. In ESEC/FSE. 722–733.
  211. On-the-Fly Adaptation of Source Code Models using Meta-Learning. arXiv:2003.11768 (2020).
  212. Chengxun Shu and Hongyu Zhang. 2017. Neural Programming by Example. In AAAI. 1539–1545.
  213. Flow2Vec: value-flow-based precise code embedding. OOPSLA 4 (2020), 1–27.
  214. Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow analysis in LLVM. In Proceedings of the 25th international conference on compiler construction. 265–266.
  215. Code Search based on Context-aware Code Translation. In ICSE. 388–400.
  216. Heterogeneous Information Networks: the Past, the Present, and the Future. Proc. VLDB Endow. 15, 12 (2022), 3807–3811.
  217. On the Importance of Building High-quality Training Datasets for Neural Code Search. In ICSE. ACM, 1609–1620.
  218. A grammar-based structural cnn decoder for code generation. In AAAI, Vol. 33. 7055–7062.
  219. TreeGen: A Tree-Based Transformer Architecture for Code Generation. In AAAI. 8984–8991.
  220. Intellicode compose: Code generation using transformer. In ESEC/FSE. 1433–1443.
  221. Fast and memory-efficient neural code completion. In MSR. 329–340.
  222. Pythia: Ai-assisted code completion system. In SIGKDD. 2727–2735.
  223. AST-Trans: Code Summarization with Efficient Tree-Structured Attention. In ICSE.
  224. C4: contrastive cross-language code clone detection. In ICPC. ACM, 413–424.
  225. Learning to fix build errors with graph2diff neural networks. In ICSE Workshops. 19–20.
  226. Evaluating representation learning of code changes for predicting patch correctness in program repair. In ASE. 981–992.
  227. On learning meaningful code changes via neural machine translation. In ICSE. 25–36.
  228. An empirical investigation into learning bug-fixing patches in the wild via neural machine translation. In ASE. 832–837.
  229. Deep learning similarities from different representations of source code. In MSR. 542–553.
  230. Neural Program Repair by Jointly Learning to Localize and Repair. In ICLR.
  231. Attention is all you need. In NeurIPS. 5998–6008.
  232. Ir2vec: Llvm ir based scalable program embeddings. TACO 17, 4 (2020), 1–27.
  233. NaturalCC: An Open-Source Toolkit for Code Intelligence. In ICSE, Companion Volume.
  234. Multi-modal Attention Network Learning for Semantic Source Code Retrieval. In ASE. 13–25.
  235. You see what I want you to see: poisoning vulnerabilities in neural code search. In ESEC/FSE. 1233–1245.
  236. What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code. In ICSE. 2377–2388.
  237. Improving automatic source code summarization via deep reinforcement learning. In ASE. 397–407.
  238. No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. In ESEC/FSE. 382–394.
  239. Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding. In ICSE. 287–298.
  240. Combining graph-based learning with automated data collection for code vulnerability detection. TIFS 16 (2020), 1943–1958.
  241. Synergy between Machine/Deep Learning and Software Engineering: How Far Are We? arXiv:2008.05515 (2020).
  242. Automatically learning semantic features for defect prediction. In ICSE. 297–308.
  243. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In SANER. 261–271.
  244. SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation. arXiv:2108.04556 (2021).
  245. GypSum: learning hybrid representations for code summarization. In ICPC. ACM, 12–23.
  246. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
  247. Yanlin Wang and Hui Li. 2021. Code completion by modeling flattened abstract syntax trees as graphs. In AAAI, Vol. 35. 14015–14023.
  248. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In EMNLP. 8696–8708.
  249. A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research. arXiv:2009.06520 (2020).
  250. Code Generation as a Dual Task of Code Summarization. In NeurIPS. 6559–6569.
  251. Retrieve and refine: exemplar-based neural comment generation. In ASE. 349–360.
  252. Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code.. In IJCAI. 3034–3040.
  253. LambdaNet: Probabilistic Type Inference using Graph Neural Networks. In ICLR.
  254. CLEAR: contrastive learning for API recommendation. In ICSE. 376–387.
  255. Sorting and transforming program repair ingredients via deep learning code similarities. In SANER. 479–490.
  256. Deep learning code fragments for code clone detection. In ASE. 87–98.
  257. Toward deep learning software repositories. In MSR. 334–345.
  258. Code Summarization with Structure-induced Transformer. In Findings of ACL. 1078–1090.
  259. Detecting Semantic Code Clones by Building AST-based Markov Chains Model. In ASE. ACM, 34:1–34:13.
  260. Detectron2. https://github.com/facebookresearch/detectron2.
  261. SCDetector: Software Functional Clone Detection Based on Semantic Tokens Analysis. In ASE. 821–833.
  262. VulCNN: An Image-inspired Scalable Vulnerability Detection System. In ICSE. 2365–2376.
  263. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
  264. Low-Resources Project-Specific Code Summarization. In ASE. ACM, 68:1–68:12.
  265. Exploiting Method Names to Improve Code Summarization: A Deliberation Multi-Task Learning Approach. In ICPC. IEEE, 138–148.
  266. Incorporating External Knowledge through Pre-training for Natural Language to Code Generation. In ACL. 6045–6052.
  267. Modeling and discovering vulnerabilities with code property graphs. In S&P. 590–604.
  268. DualSC: Automatic Generation and Summarization of Shellcode via Transformer and Dual Learning. In SANER. 361–372.
  269. A Survey on Deep Learning for Software Engineering. ACM Comput. Surv. 54, 10s, Article 206 (sep 2022), 73 pages.
  270. A Multi-Modal Transformer-based Code Summarization Approach for Smart Contracts. In ICPC. IEEE, 1–12.
  271. Natural Attack for Pre-trained Models of Code. In ICSE. ACM, 1482–1493.
  272. Coacor: Code annotation for code retrieval with reinforcement learning. In The World Wide Web Conference. 2203–2214.
  273. Michihiro Yasunaga and Percy Liang. 2020. Graph-based, self-supervised program repair from diagnostic feedback. In ICML. 10799–10808.
  274. Leveraging code generation to improve code retrieval and summarization via dual learning. In Proceedings of The Web Conference 2020. 2309–2319.
  275. Adversarial examples for models of code. OOPSLA 4 (2020), 1–30.
  276. Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General-Purpose Code Generation. In ACL. 440–450.
  277. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In EMNLP. 1653–1663.
  278. CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases. In EMNLP. 1962–1979.
  279. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In EMNLP. 3911–3921.
  280. SParC: Cross-Domain Semantic Parsing in Context. In ACL. 4511–4523.
  281. Generating adversarial examples for holding robustness of source code processing models. In AAAI, Vol. 34. 1169–1176.
  282. Disentangled Code Representation Learning for Multiple Programming Languages. In Findings of ACL. 4454–4466.
  283. CoditT5: Pretraining for Source Code and Natural Language Editing. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 22:1–22:12.
  284. Retrieval-based neural source code summarization. In ICSE. 1385–1397.
  285. A novel neural source code representation based on abstract syntax tree. In ICSE. 783–794.
  286. Interpretable Program Synthesis. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
  287. Adversarial attacks on deep-learning models in natural language processing: A survey. TIST 11, 3 (2020), 1–41.
  288. Diet code is healthy: simplifying programs for pre-trained models of code. In ESEC/FSE. 1073–1084.
  289. Gang Zhao and Jeff Huang. 2018. Deepsim: deep learning code functional similarity. In ESEC/FSE. 141–151.
  290. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568 (2023).
  291. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv:1709.00103 (2017).
  292. Assessing Generalizability of CodeBERT. In ICSME. IEEE, 425–436.
  293. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In NeurIPS. 10197–10207.
  294. Adversarial Robustness of Deep Code Comment Generation. TOSEM 31, 4 (2022), 60:1–60:30.
  295. OCoR: an overlapping-aware code retriever. In ASE. 883–894.
  296. A syntax-guided edit decoder for neural program repair. In ESEC/FSE. 341–353.
  297. A Simple Retrieval-based Method for Code Comment Generation. In SANER. 1089–1100.
  298. μ𝜇\muitalic_μVulDeePecker: A deep learning-based system for multiclass vulnerability detection. TDSC (2019).
  299. Interpreting deep learning-based vulnerability detector predictions based on heuristic searching. TOSEM 30, 2 (2021), 1–31.
  300. Language-Agnostic Representation Learning of Source Code from Structure and Context. In ICLR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yao Wan (70 papers)
  2. Yang He (117 papers)
  3. Zhangqian Bi (7 papers)
  4. Jianguo Zhang (97 papers)
  5. Hongyu Zhang (147 papers)
  6. Yulei Sui (29 papers)
  7. Guandong Xu (93 papers)
  8. Hai Jin (83 papers)
  9. Philip S. Yu (592 papers)
Citations (11)