SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations (2201.01549v4)

Published 5 Jan 2022 in cs.SE

Abstract: Recent years have seen the successful application of large pre-trained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pre-training tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pre-training tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning.

PDF Abstract

Overview of "SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations"

The paper "SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations" presents a novel approach for learning source code representations through pre-training. The proposed model, SPT-Code, addresses key limitations in existing methods by leveraging sequence-to-sequence architecture and integrating structural and natural language information derived from source code.

Key Contributions and Methodology

The authors identify several gaps in existing pre-trained models for source code. These models typically focus only on the encoder part using natural language-based pre-training tasks, neglecting the structural information inherent in code sequences. Additionally, they often require a bilingual corpus of code and its natural language description, which limits the scope and depth of training data.

To overcome these limitations, SPT-Code introduces several innovations:

Seq2Seq Pre-training Architecture: Unlike most models that pre-train only the encoder, SPT-Code utilizes a sequence-to-sequence (seq2seq) architecture to jointly train both the encoder and decoder. This is beneficial for generation tasks that inherently rely on both components.
Code-Specific Pre-Training Tasks: The paper introduces three pre-training tasks specific to the source code domain:
- Code-AST Prediction (CAP): Designed to acquire syntactic information using Abstract Syntax Trees (ASTs), this task predicts whether a given AST sequence matches the source code tokens.
- Modified MASS: Adapting the Masked Sequence to Sequence method for code, this task improves the model's understanding and generation of code sequences.
- Method Name Generation (MNG): Utilizes method names and invoked function names as concise natural language descriptions, capturing functional intent and achieving better code summarization performance.
AST and Natural Language Integration: By incorporating simplified and linearized ASTs known as X-SBT, SPT-Code captures syntactic structural information. It also uses natural language inputs without relying on extensive bilingual corpus, thereby widening its applicability to monolingual code datasets.

Results and Discussion

SPT-Code was evaluated on five downstream tasks—code summarization, code completion, bug fixing, code translation, and code search. The experimental results underscored its state-of-the-art performance across these tasks, demonstrating particularly strong outcomes in generation-based tasks like code summarization and translation.

The inclusion of ASTs as input significantly improved performance, as structural information is crucial for understanding code semantics. Ablation studies highlighted the differential impact of each pre-training task and input component on the downstream tasks, suggesting that tailored configuration could optimize performance for specific applications.

Implications and Future Work

SPT-Code represents a significant advancement in the learning of heterogeneous source code representations, showcasing how structural and functional aspects of code can be integrated within pre-training schemes. The authors suggest that this approach could facilitate broader applications and improved outcomes in software engineering tasks, given its robust pre-training capabilities and flexibility in handling unlabeled datasets.

Future research could explore extensions of SPT-Code to additional programming languages and investigate its adaptability in collaborative software development environments. Further refinement of pre-training tasks could enhance its discriminative capabilities, fostering deeper semantic understanding and more nuanced code generation possibilities.

In conclusion, the development and evaluation of SPT-Code illuminate promising pathways for enhancing the efficacy of pre-trained models in software engineering, offering a forward-looking perspective on leveraging structural code information within pre-training architectures.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Changan Niu (7 papers)
Chuanyi Li (16 papers)
Vincent Ng (24 papers)
Jidong Ge (17 papers)
Liguo Huang (6 papers)
Bin Luo (209 papers)

Citations (101)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos