A Survey on Pretrained Language Models for Neural Code Intelligence (2212.10079v1)

Published 20 Dec 2022 in cs.SE, cs.CL, and cs.LG

Abstract: As the complexity of modern software continues to escalate, software engineering has become an increasingly daunting and error-prone endeavor. In recent years, the field of Neural Code Intelligence (NCI) has emerged as a promising solution, leveraging the power of deep learning techniques to tackle analytical tasks on source code with the goal of improving programming efficiency and minimizing human errors within the software industry. Pretrained LLMs have become a dominant force in NCI research, consistently delivering state-of-the-art results across a wide range of tasks, including code summarization, generation, and translation. In this paper, we present a comprehensive survey of the NCI domain, including a thorough review of pretraining techniques, tasks, datasets, and model architectures. We hope this paper will serve as a bridge between the natural language and programming language communities, offering insights for future research in this rapidly evolving field.

Authors (2)

Yichen Xu (40 papers)
Yanqiao Zhu (45 papers)

Citations (14)

View on Semantic Scholar

Summary

Introduction

The burgeoning complexity of software systems necessitates advanced tools to enhance programming efficacy and minimize human errors. In the field of Neural Code Intelligence (NCI), pretrained LLMs (LMs) take center stage—transforming the landscape by delivering state-of-the-art results in a myriad of code-related analytical tasks. This paper provides a methodical survey of NCI disciplines, particularly focusing on the role of pretrained LMs. By doing so, it aims to fuse the NLP and programming language (PL) communities, fostering insights that can accelerate future research endeavors.

Pretraining Techniques and Model Architectures

At the core of NCI models are pretraining techniques and model architectures that have been intricately reviewed in this work. The approach often involves preprocessing steps such as tokenization and code structure extraction. For tokenization, adaptations are done to accommodate the syntactic and semantic peculiarities of programming languages. Structure extraction takes advantage of code’s inherent syntax and semantic representation, often employing graphical forms.

The intricate neural modeling for code tokens falls predominantly under the umbrella of Transformers, given their proven efficacy in understanding sequential data. Recent adaptations in the NCI domain involve modifications for incorporating code structures, particularly through variations of self-attention that align with semantic constructs inherent to code. The roles of encoder-only, encoder-decoder, and decoder-only transformer models have been explored in this domain, each exhibiting unique applications and benefits.

Training Paradigms and Objectives

Training paradigms for NCI models closely follow those of advanced LMs typically seen in NLP, but they also incorporate programming-specific tasks. Techniques like Masked LLMing (MLM), Next Sentence Prediction (NSP), and Masked Span Prediction (MSP) have been tediously refined to cater to source code’s uniqueness. Additionally, code-specific objectives that reckon with source code semantics, such as the Replaced Token Detection (RTD) or Identifier Deobfuscation (DOBF), have shown improvements in task performance.

A Look at Datasets and Tasks

The paper provides an extensive look at datasets and tasks that serve as benchmarks for evaluating NCI models' performance. These include a wide range—from defect detection to program synthesis, touching upon datasets like CodeSearchNet, GitHub Code, PY150, and many more. An interesting thread in this discussion is the capability of these models to generalize across tasks, underscoring the versatility and breadth of pretrained LMs for NCI applications.

Future Trajectories and Challenges

Despite the remarkable progress, numerous challenges persist. In particular, this paper points out the need for more refined integration of the rich syntactic and semantic structures of code into these models. Furthermore, a call is put out for research that leverages runtime semantics of programs, an area relatively less traversed in NCI. The paper also spurs contemplation on how to effectively link project- and library-level knowledge into the training and operation of these models, considering the real-world hierarchies and connections in large codebases.

Conclusion

This survey on pretrained LLMs for NCI is an invaluable treatise for experts in the field. It not only presents a thorough overview of current methodologies, tasks, and datasets but also propels forward by highlighting the imminent research opportunities and challenges. As the field of NCI matures, this comprehensive review can act as a cornerstone to inspire novel research, enhance the efficacy of coding tools, and ultimately push the boundaries of software engineering.

PDF Markdown

Related Papers

Find Related Papers