CodeGen2: Lessons for Training LLMs on Programming and Natural Languages (2305.02309v2)

Published 3 May 2023 in cs.LG

Abstract: LLMs have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly. In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal LLMing, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored. We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into five lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen.

PDF HTML Abstract

An Expert Review of "CodeGen2: Lessons for Training LLMs on Programming and Natural Languages"

Introduction

"CodeGen2: Lessons for Training LLMs on Programming and Natural Languages" addresses the increasingly relevant topic of enhancing the efficiency and efficacy of LLMs tailored for program synthesis and natural language tasks. The research conducted by Erik Nijkamp and colleagues at Salesforce Research introduces a comprehensive set of empirical experiments focusing on unifying model architectures, learning methods, sampling procedures, and data distributions.

Core Contributions and Hypotheses

The paper lays out four primary hypotheses to achieve a unified, efficient training recipe for LLMs:

Model Architecture: Can encoder and decoder representations be unified into a Prefix-LM without compromising performance on downstream tasks?
Learning Algorithm: Does a mixture of causal LLMing and span corruption yield an efficient and effective learning objective?
Sampling Procedure: Can infill sampling be incorporated without an additional computational cost, as suggested by the "free lunch" hypothesis?
Data Distributions: Can a mixture of programming and natural languages benefit tasks in both domains without a performance trade-off?

Empirical Findings and Lessons

Lesson 1: Prefix-LM's Benefit is Questionable

The team attempted to unify bi-directional attention (encoder) and uni-directional decoding (decoder) in a Prefix-LM architecture. It was hypothesized that this architecture might offer competitive performance for both program synthesis and understanding tasks. However, empirical results indicate that this architecture does not yield measurable benefits over the causal-decoder baseline. Representation learning and task performance remained comparable, but the substantial trade-offs in multilanguage performance and the utility of bi-directional attention did not justify the architectural shift.

Lesson 2: Infill is Not a "Free Lunch"

Contrary to the "free lunch" hypothesis, which posits that infill sampling can be added without additional computational cost, the researchers found that training models with infill capability does lead to a performance trade-off. The observed degradation in HumanEval pass rates indicates that incorporating infill sampling isn't as computationally inexpensive as suggested. This lesson stresses the need for further optimization if infill sampling is to be utilized effectively.

Lesson 3: A Simple, Yet Effective Objective

The team discovered that their proposed mixed objective, which combines causal LLMing and span corruption with minimal task-specific bias, led to competitive training outcomes. Their approach simplified the learning process while maintaining robustness in both left-to-right and infill sampling capabilities. This finding emphasizes the potential of streamlined objectives in training LLMs efficiently.

Lesson 4: Multi-modal Data Mixing

In evaluating the impact of training LLMs on a mixture of natural and programming languages, the authors observed that mixed-data training did not outperform domain-specific counterparts given the same compute budget. However, the mixed data approach efficiently leveraged training signals across both domains, suggesting it as a viable strategy if cross-domain application is a priority and the compute budget permits extended training.

Lesson 5: Multi-epoch Training

The application of multi-epoch training, as exemplified by the CodeGen2.5 model, showed considerable performance improvements on the HumanEval benchmark. The method involved repeating observations with span corruption to effectively augment the data. This finding supports the hypothesis that LLMs benefit from repeated exposures to data, albeit the specifics of span corruption and learning rate decay warrant further investigation.

Practical and Theoretical Implications

The practical implications of this paper are significant for researchers and practitioners aiming to develop versatile and efficient LLMs. By providing empirically backed insights, the research outlines a more streamlined approach to model architecture, objective functions, and data handling, thereby reducing the computational overhead and complexity of training LLMs. Theoretically, the paper contributes to the ongoing discourse on optimal LLM training strategies, particularly in the context of program synthesis and multi-modal data utilization.

Future Directions

Future research avenues suggested by this work include more granular ablations of the multi-epoch training process to isolate contributing factors to its success, refining infill sampling techniques to minimize computational overhead, and exploring further simplifications of mixed objectives to balance performance and computational efficiency.

Conclusion

"CodeGen2: Lessons for Training LLMs on Programming and Natural Languages" offers valuable insights and pragmatic strategies for enhancing LLM training efficacy. While full unification of the proposed aspects remains elusive, the distilled lessons provide a solid foundation for future advancements in LLM development. The open-sourcing of CodeGen2 models and their training framework promises to facilitate continued innovation and application in the community.