Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Large Language Models for Code Generation (2406.00515v2)

Published 1 Jun 2024 in cs.CL, cs.AI, and cs.SE
A Survey on Large Language Models for Code Generation

Abstract: LLMs have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. Despite the active exploration of LLMs for a variety of code tasks, either from the perspective of NLP or software engineering (SE) or both, there is a noticeable absence of a comprehensive and up-to-date literature review dedicated to LLM for code generation. In this survey, we aim to bridge this gap by providing a systematic literature review that serves as a valuable reference for researchers investigating the cutting-edge progress in LLMs for code generation. We introduce a taxonomy to categorize and discuss the recent developments in LLMs for code generation, covering aspects such as data curation, latest advances, performance evaluation, ethical implications, environmental impact, and real-world applications. In addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the HumanEval, MBPP, and BigCodeBench benchmarks across various levels of difficulty and types of programming tasks to highlight the progressive enhancements in LLM capabilities for code generation. We identify critical challenges and promising opportunities regarding the gap between academia and practical development. Furthermore, we have established a dedicated resource GitHub page (https://github.com/juyongjiang/CodeLLMsurvey) to continuously document and disseminate the most recent advances in the field.

A Survey on LLMs for Code Generation

The paper "A Survey on LLMs for Code Generation" undertakes a systematic review of the advancements and practices in leveraging LLMs for code generation tasks. This document stands as an essential reference for researchers exploring both the progression and current state of LLMs tailored for code generation.

At the outset, the paper distinguishes itself by highlighting the gap in literature surrounding code LLMs. Despite the rapid exploration of LLMs in code-related tasks, the absence of a comprehensive and up-to-date review specific to code generation is evident. The authors address this void by presenting a taxonomy that categorizes recent advancements into pre-training datasets, instruction tuning datasets, various aspects of performance evaluation, and real-world applications.

Key Points from the Paper

  1. Pre-training and Model Architectures: The paper details the use of large-scale unlabeled datasets in pre-training LLMs, capturing essential coding semantics, structures, and dependencies. It introduces models categorized as either encoder-decoder or decoder-only, and it elucidates the dominant pre-training tasks, namely causal LLMing and denoising autoencoding.
  2. Instruction Tuning: Instruction tuning is emphasized as critical in enhancing models to follow task-specific prompts effectively. This tuning leverages both proprietary and public datasets that are refined into instruction-like formats to fine-tune LLMs, adapting them to code generation more accurately.
  3. Advances in Data Synthesis: Due to the scarcity of high-quality data, synthetic data emerges as a pivotal resource, enabling expanded instruction-following capabilities in LLMs. The paper outlines processes like Self-Instruct and Evolutionary Instruction methods that generate and refine synthetic datasets, enhancing model performance meaningfully.
  4. Evaluation Benchmarks: A detailed examination of various benchmarks, like HumanEval and MBPP, is presented, which are crucial in assessing LLM capabilities. The authors stress the importance of developing comprehensive benchmarks that reflect the complexities of real-world software development tasks.
  5. Prompting, Repository Level, and Retrieval-Augmented Generation: Customized prompting methods and retrieval-augmented generation strategies are discussed, which enhance LLMs’ output by providing structured and context-rich inputs. Techniques like self-debugging, reflective learning, and retrieval-augmented methods are highlighted as essential for optimizing LLMs in code contexts.
  6. Applications and Real-world Impact: The practical applications of code LLMs are embraced in tools such as GitHub Copilot and Amazon CodeWhisperer. These tools illustrate how LLMs enhance coding efficiency, highlight the real-world impact, and foster advances that traverse both academic inquiry and industry application.

Challenges and Future Directions

The paper doesn't shy away from addressing the ongoing challenges faced by the field. Key issues include bridging the gap between academia and practical software development, refining model architectures to better capture code's structural properties, and improving benchmarks to more comprehensively evaluate model performance. The paper also identifies opportunities for LLMs to extend their applicability to low-resource and domain-specific programming languages, suggesting a path forward for more specialized, adaptable models. Further, continuous learning strategies that allow LLMs to remain updated and aligned with ever-evolving programming practices are posited as a crucial area for future paper.

In conclusion, this survey not only delineates the considerable strides made in code generation through LLMs but also posits a multitude of directions for future enhancement. By comprehensively documenting the landscape, this paper serves as a foundational text for researchers and practitioners aiming to navigate and innovate within the rapidly evolving intersection of machine learning and software development.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Juyong Jiang (14 papers)
  2. Fan Wang (312 papers)
  3. Jiasi Shen (6 papers)
  4. Sungju Kim (3 papers)
  5. Sunghun Kim (44 papers)
Citations (73)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com