CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis (2203.13474v5)

Published 25 Mar 2022 in cs.LG, cs.CL, and cs.PL

Abstract: Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of LLMs advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of LLMs up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.

PDF Abstract

Multi-Turn Program Synthesis with CodeGen: A Comprehensive Analysis

The paper "CodeGen: An Open LLM for Code with Multi-Turn Program Synthesis," is a novel investigation into the domain of program synthesis using LLMs. The authors present a family of models named CodeGen, unique for being trained on both natural language and programming language data up to a scale of 16.1 billion parameters. This research provides insights into the evolving capabilities of LLMs in program synthesis, especially in a novel multi-turn paradigm.

Model Training and Dataset

The CodeGen models are trained sequentially on three datasets: ThePile, BigQuery, and BigPython. Each dataset presents unique characteristics, from general natural language data in ThePile to multi-lingual and mono-lingual programming code in BigQuery and BigPython, respectively. The comprehensive pre-processing—including filtering, deduplication, tokenization, shuffling, and concatenation—ensures a robust training dataset, allowing the models to generalize across different contexts effectively.

The models leverage standard transformer-based autoregressive LLMs, with various configurations (350M, 2.7B, 6.1B, 16.1B parameters) to evaluate scaling effects. Utilizing a custom library, JAXformer, optimized for TPU-v4 hardware, facilitates efficient large-scale training. The models are trained sequentially, with each subsequent model inheriting weights from its predecessor, ensuring progressive learning across datasets.

Single-Turn Evaluation

The evaluation starts with a single-turn synthesis task on the HumanEval benchmark. This benchmark involves 164 Python programming problems, each requiring functional code generation from a given prompt. The evaluation employs metrics like pass@ $k$ (where $k \in \{1, 10, 100\}$ ), highlighting the functional correctness of generated programs. The results indicate that the CodeGen models, particularly the largest configuration, approach the performance of OpenAI's Codex models.

Multi-Turn Program Synthesis

A significant contribution of the paper is the introduction of a multi-turn paradigm for program synthesis. The authors argue that decomposing user intent into multiple, manageable prompts enhances the LLM's understanding and, consequently, the quality of the synthesized programs. To facilitate this, they introduce the Multi-Turn Programming Benchmark (MTPB), comprising 115 diverse problems that require multi-step communication between the user and the model.

Evaluation on the MTPB reveals that the multi-turn approach significantly outperforms the single-turn paradigm. The multi-turn method's pass rates improve as the model and data size increase, suggesting a strong correlation between model scale and synthesis quality. The use of multi-turn prompts reduces the perplexity of user specifications, improving intent understanding and leading to better program synthesis outcomes.

Implications and Future Directions

The research presented holds several implications for both practical applications and theoretical advancements in AI. Practically, making the training library (JAXformer) and the model checkpoints openly available democratizes access to large-scale program synthesis models, fostering further research and innovation. Theoretically, the success of multi-turn synthesis opens new avenues for exploring more interactive and iterative forms of human-AI collaboration.

Future research could explore several interesting directions, such as:

Refinement of User Intent Understanding: Investigating more sophisticated methods for breaking down user intents and contextually adapting model responses over extended interaction sequences.
Robustness and Safety: Developing mechanisms to ensure the robustness of generated code, including handling edge cases and ensuring the security and reliability of AI-generated programs.
Cross-Domain Applications: Expanding the application of multi-turn program synthesis beyond traditional coding tasks to other domains like data science, algorithm design, and even automated research assistants.

Conclusion

The paper of CodeGen and its multi-turn program synthesis approach marks a significant step forward in leveraging LLMs for complex coding tasks. Through meticulous training, innovative evaluation benchmarks, and a focus on open access, the paper demonstrates the evolving capabilities of AI in understanding and generating code. This work lays a solid foundation for future advancements in program synthesis, promoting a deeper integration of natural language understanding and code generation.

By demonstrating the efficacy of multi-turn synthesis and providing valuable tools for the research community, the paper contributes significantly to the ongoing development of intelligent coding assistants and interactive AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Erik Nijkamp (22 papers)
Bo Pang (77 papers)
Hiroaki Hayashi (17 papers)
Lifu Tu (19 papers)
Huan Wang (211 papers)
Yingbo Zhou (81 papers)
Silvio Savarese (200 papers)
Caiming Xiong (337 papers)

Citations (814)

View on Semantic Scholar