CodeT5+: Open Code Large Language Models for Code Understanding and Generation (2305.07922v2)

Published 13 May 2023 in cs.CL, cs.LG, and cs.PL

Abstract: LLMs pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

PDF Abstract

CodeT5+ is a new design for LLMs that understand and generate programming code. The work focuses on overcoming two main limitations seen in current systems: the rigid structure of their architectures and the limited set of pretraining tasks they use. Here’s a detailed look at what CodeT5+ does and why it matters.

Background and Motivation

Diverse Task Requirements: Existing code models are usually built as either encoder-only models (good for understanding tasks like retrieving code) or decoder-only models (better for code generation tasks). Some use a unified encoder-decoder design, but this single system often leads to suboptimal performance in certain tasks because it does not specialize its internal components for the diverse nature of code tasks.
Limited Pretraining Objectives: Most current models rely on a small set of learning tasks during pretraining, such as span denoising or next-token prediction. This limited range means the models may not fully capture the rich relationships in code data, leading to a gap between what the model learns during pretraining and what it needs for downstream tasks like code generation or understanding.

Key Ideas and Approach

Flexible Encoder-Decoder Architecture:
- Encoder-only mode for tasks such as code retrieval.
- Decoder-only mode for decoding tasks like code completion.
- Full encoder-decoder mode for generation tasks such as translating a natural language description into code.
Mixture of Pretraining Objectives:
- Span Denoising: Similar to T5, parts of the code are masked and the model learns to predict these spans.
- Causal LLMing (CLM): Two variations of CLM are used:
- A variant where the model predicts code tokens after a chosen pivot point.
- A decoder-only variant where the model generates an entire sequence from scratch.
- Text-Code Contrastive Learning and Matching: When code is paired with natural language descriptions, the model learns to align the semantics of text and code. Contrastive learning helps pull together representations of matching text-code pairs and push apart mismatched pairs. A matching objective further refines this alignment by having the decoder explicitly judge if a given text-code pair is a correct match.
Two-Stage Pretraining Strategy:

The training is split into two stages: 1. Unimodal Pretraining: The model is first exposed to large-scale code-only data. Here, the span denoising objective along with the two CLM variants help the model learn how code is structured and generated. 2. Bimodal Pretraining: Next, the training shifts to text-code pairs (for example, code functions with accompanying comments). The model is trained with contrastive learning, matching tasks, and additional causal LM objectives. This helps it better handle tasks that involve both code and natural language.

Compute-Efficient Scaling:
- Pretrained, off-the-shelf models for the code domain (like those from CodeGen) initialize the encoder and decoder.
- Most of the deep decoder is frozen during training, and only a small encoder and new cross-attention layers are updated. This means fewer parameters need to be tuned while still achieving high performance on complex tasks.
Instruction Tuning:

In addition to the self-supervised pretraining tasks, a step called instruction tuning is used. During this phase, the model is fine-tuned on synthetic instruction datasets. In these datasets, natural language instructions are paired with code generation tasks. This makes the model better aligned to work in a zero-shot setting, where it follows human-like instructions directly.

Experimental Findings and Contributions

State-of-the-Art Performance:

When evaluated across more than 20 benchmarks—including code generation, code understanding, math programming, and retrieval tasks—CodeT5+ consistently shows improved performance. For example, in generating Python code from natural language prompts (the HumanEval benchmark), the instruction-tuned version of the model achieves results that surpass even some well-known closed-source systems.

Versatility Across Tasks:
- Decoder-only settings for next-line code completion, where only part of the model is activated.
- Full encoder-decoder settings for tasks such as code summarization (translating code into descriptive comments) and text-to-code retrieval.
- A unified approach where the same model can serve both as a retriever (searching for useful code snippets) and generator (producing new code based on retrieved context).
Complementary Pretraining Tasks:

The combination of span denoising with causal LM tasks helps the model capture both the structure of code and its sequential dependencies. Meanwhile, contrastive and matching objectives on text-code pairs ensure that the model develops a fine-grained alignment between natural language and code semantics—an important feature for tasks like code search and retrieval.

Why This Matters

For anyone interested in improving how machines understand and generate code, CodeT5+ presents an innovative and flexible pathway. By addressing the inherent limitations in traditional architectures and pretraining setups, it enables a single model to adapt to many code-related tasks effectively. Developers and researchers can benefit from:

A unified model that covers a wide range of programming tasks.
Efficient training requirements by only updating parts of the model while leveraging existing large-scale models.
Improved performance in both code generation and understanding, which can enhance tools used for code completion, summarization, and even detecting vulnerabilities.

In summary, CodeT5+ offers a comprehensive framework to bridge the gap between different types of code tasks while keeping training efficient. Its flexible design and diverse pretraining objectives make it a strong candidate for future applications in code intelligence, opening up new possibilities for developing smarter programming assistants and automated code generation tools.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yue Wang (675 papers)
Hung Le (120 papers)
Akhilesh Deepak Gotmare (7 papers)
Nghi D. Q. Bui (30 papers)
Junnan Li (56 papers)
Steven C. H. Hoi (94 papers)

Citations (370)

View on Semantic Scholar

CodeT5+: Open Code Large Language Models for Code Understanding and Generation (2305.07922v2)

Related Papers