CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation (2206.06888v1)

Published 14 Jun 2022 in cs.SE, cs.CL, and cs.PL

Abstract: Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the success of pre-training techniques, LLMs are trained on large-scale unlabelled code corpora and perform well in code generation. In this paper, we investigate how to leverage an unlabelled code corpus to train a model for library-oriented code generation. Since it is a common practice for programmers to reuse third-party libraries, in which case the text-code paired data are harder to obtain due to the huge number of libraries. We observe that library-oriented code snippets are more likely to share similar code sketches. Hence, we present CERT with two steps: a sketcher generates the sketch, then a generator fills the details in the sketch. Both the sketcher and the generator are continually pre-trained upon a base model using unlabelled data. Furthermore, we craft two benchmarks named PandasEval and NumpyEval to evaluate library-oriented code generation. Experimental results demonstrate the impressive performance of CERT. For example, it surpasses the base model by an absolute 15.67% improvement in terms of pass@1 on PandasEval. Our work is available at https://github.com/microsoft/PyCodeGPT.

Authors (9)

Daoguang Zan (24 papers)
Bei Chen (56 papers)
Dejian Yang (11 papers)
Zeqi Lin (25 papers)
Minsu Kim (115 papers)
Bei Guan (11 papers)
Yongji Wang (21 papers)
Weizhu Chen (128 papers)
Jian-Guang Lou (69 papers)

Citations (100)

View on Semantic Scholar

Summary

CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation

The paper "CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation" addresses the problem of generating code from natural language descriptions in scenarios where third-party libraries are frequently used. Traditional methods of training code generation models rely heavily on labeled datasets of paired text and code, which are costly and time-consuming to create, especially when dealing with the numerous and diverse libraries programmers may use. This research proposes a novel model called CERT, which leverages unlabeled code corpora to improve code generation specifically for library-oriented tasks by continually pre-training LLMs on code sketches.

Approach and Methodology

The CERT model is composed of two primary components: a sketcher and a generator. The sketcher generates an outline or "sketch" of the code that omits specific user-defined details, while the generator fills in these details to produce the complete code. This method exploits the observation that code snippets utilizing libraries commonly exhibit similar structural patterns or "sketches." Both components leverage continual pre-training over a base model, using large-scale unlabeled datasets to enhance their library-specific code generation capabilities.

The paper employs two benchmark datasets, PandasEval and NumpyEval, which are carefully curated to measure the effectiveness of library-oriented code generation. These benchmarks consist of a collection of programming tasks predominantly solved using the Pandas and NumPy libraries, respectively, and gauge the ability of models to produce correct code by analyzing pass $@k$ metrics for various values of $k$ .

Experimental Results

Experimental results demonstrate the efficacy of CERT, showing consistent performance improvements across the PandasEval and NumpyEval datasets. Notably, CERT significantly improves the pass $@1$ rate by at least 12% compared to baseline models when dealing with library-oriented code tasks. This improvement is highlighted over two base models, PyCodeGPT and CodeGen, which underscore CERT's adaptability and extensibility across different model architectures.

A ablation paper of sketching techniques within CERT reveals variants focused exclusively on anonymizing constants or user-defined names, with the default model—focusing on constants—showcasing leading results. The paper confirms that the quality of library-oriented code sketches is a critical factor for effective code generation, which CERT exploits by accurately generating syntactically correct and structurally relevant sketches.

Implications and Future Work

The use of unlabeled code corpora for continual pre-training represents a significant shift from traditional supervised learning methods, reducing dependency on costly annotated datasets. CERT's success suggests substantial potential for future research directions, such as the development of techniques tailored to extracting and leveraging patterns from specific library APIs or applying similar methodologies to more diverse libraries and domains. Moreover, the principles demonstrated by CERT could guide the generation of more complex code solutions involving multiple libraries or specialized proprietary libraries, expanding the applicability of LLMs in real-world coding environments.

Future investigations might also consider the integration of other information sources, such as contextual cues from project documentation or historical code revisions, to enhance the sketching and completion processes, potentially leading to even greater improvements in library-oriented code generation tasks.

Overall, the CERT framework opens compelling avenues for advancing code generation capabilities, paving the way for models that are more capable of reflecting and understanding higher-order abstractions and reuse patterns pervasive within library-focused programming tasks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/PyCodeGPT: A pre-trained GPT model for Python code completion and generation (266 stars)