Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages (2305.02309v2)

Published 3 May 2023 in cs.LG

Abstract: LLMs have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly. In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal LLMing, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored. We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into five lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen.

An Expert Review of "CodeGen2: Lessons for Training LLMs on Programming and Natural Languages"

Introduction

"CodeGen2: Lessons for Training LLMs on Programming and Natural Languages" addresses the increasingly relevant topic of enhancing the efficiency and efficacy of LLMs tailored for program synthesis and natural language tasks. The research conducted by Erik Nijkamp and colleagues at Salesforce Research introduces a comprehensive set of empirical experiments focusing on unifying model architectures, learning methods, sampling procedures, and data distributions.

Core Contributions and Hypotheses

The paper lays out four primary hypotheses to achieve a unified, efficient training recipe for LLMs:

  1. Model Architecture: Can encoder and decoder representations be unified into a Prefix-LM without compromising performance on downstream tasks?
  2. Learning Algorithm: Does a mixture of causal LLMing and span corruption yield an efficient and effective learning objective?
  3. Sampling Procedure: Can infill sampling be incorporated without an additional computational cost, as suggested by the "free lunch" hypothesis?
  4. Data Distributions: Can a mixture of programming and natural languages benefit tasks in both domains without a performance trade-off?

Empirical Findings and Lessons

Lesson 1: Prefix-LM's Benefit is Questionable

The team attempted to unify bi-directional attention (encoder) and uni-directional decoding (decoder) in a Prefix-LM architecture. It was hypothesized that this architecture might offer competitive performance for both program synthesis and understanding tasks. However, empirical results indicate that this architecture does not yield measurable benefits over the causal-decoder baseline. Representation learning and task performance remained comparable, but the substantial trade-offs in multilanguage performance and the utility of bi-directional attention did not justify the architectural shift.

Lesson 2: Infill is Not a "Free Lunch"

Contrary to the "free lunch" hypothesis, which posits that infill sampling can be added without additional computational cost, the researchers found that training models with infill capability does lead to a performance trade-off. The observed degradation in HumanEval pass rates indicates that incorporating infill sampling isn't as computationally inexpensive as suggested. This lesson stresses the need for further optimization if infill sampling is to be utilized effectively.

Lesson 3: A Simple, Yet Effective Objective

The team discovered that their proposed mixed objective, which combines causal LLMing and span corruption with minimal task-specific bias, led to competitive training outcomes. Their approach simplified the learning process while maintaining robustness in both left-to-right and infill sampling capabilities. This finding emphasizes the potential of streamlined objectives in training LLMs efficiently.

Lesson 4: Multi-modal Data Mixing

In evaluating the impact of training LLMs on a mixture of natural and programming languages, the authors observed that mixed-data training did not outperform domain-specific counterparts given the same compute budget. However, the mixed data approach efficiently leveraged training signals across both domains, suggesting it as a viable strategy if cross-domain application is a priority and the compute budget permits extended training.

Lesson 5: Multi-epoch Training

The application of multi-epoch training, as exemplified by the CodeGen2.5 model, showed considerable performance improvements on the HumanEval benchmark. The method involved repeating observations with span corruption to effectively augment the data. This finding supports the hypothesis that LLMs benefit from repeated exposures to data, albeit the specifics of span corruption and learning rate decay warrant further investigation.

Practical and Theoretical Implications

The practical implications of this paper are significant for researchers and practitioners aiming to develop versatile and efficient LLMs. By providing empirically backed insights, the research outlines a more streamlined approach to model architecture, objective functions, and data handling, thereby reducing the computational overhead and complexity of training LLMs. Theoretically, the paper contributes to the ongoing discourse on optimal LLM training strategies, particularly in the context of program synthesis and multi-modal data utilization.

Future Directions

Future research avenues suggested by this work include more granular ablations of the multi-epoch training process to isolate contributing factors to its success, refining infill sampling techniques to minimize computational overhead, and exploring further simplifications of mixed objectives to balance performance and computational efficiency.

Conclusion

"CodeGen2: Lessons for Training LLMs on Programming and Natural Languages" offers valuable insights and pragmatic strategies for enhancing LLM training efficacy. While full unification of the proposed aspects remains elusive, the distilled lessons provide a solid foundation for future advancements in LLM development. The open-sourcing of CodeGen2 models and their training framework promises to facilitate continued innovation and application in the community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  4. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
  5. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
  8. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021b.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  10. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  320–335, 2022.
  11. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.
  12. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  13. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533, 2022.
  14. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, 2020.
  15. Starcoder: may the source be with you! 2023.
  16. Competition-level code generation with alphacode, Feb 2022.
  17. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
  18. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
  19. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  20. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  21. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144.
  22. Improving language understanding by generative pre-training. 2018.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
  24. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022a.
  25. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022b.
  26. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  27. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts, pp.  1–7, 2022.
  28. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  29. What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp. 22964–22984. PMLR, 2022.
  30. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  31. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Erik Nijkamp (22 papers)
  2. Hiroaki Hayashi (17 papers)
  3. Caiming Xiong (337 papers)
  4. Silvio Savarese (200 papers)
  5. Yingbo Zhou (81 papers)
Citations (142)
Youtube Logo Streamline Icon: https://streamlinehq.com