Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

305 1

SelfCodeAlign: Self-Alignment for Code Generation (2410.24198v2)

Published 31 Oct 2024 in cs.CL, cs.LG, and cs.SE

Abstract: Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of LLMs to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.

PDF HTML Abstract

SelfCodeAlign: Self-Alignment for Code Generation

The paper, "SelfCodeAlign: Self-Alignment for Code Generation," presents an innovative pipeline for enhancing the capabilities of code generation models through a process termed self-alignment. The authors introduce a fully transparent approach that circumvents the dependencies on human annotations and distillation from larger proprietary LLMs, which traditionally acts as a bottleneck in the scaling and legal usability of the models.

Key Contributions

SelfCodeAlign Methodology: The paper introduces SelfCodeAlign, a novel pipeline that allows a code generation model to align itself using self-generated instruction data. This pipeline involves several stages, including concept extraction from seed snippets, instruction generation, response generation, and self-validation through execution in a sandbox environment.
Evaluation Results: The authors used SelfCodeAlign with various models, including CodeQwen1.5, resulting in a dataset of 74,000 instruction-response pairs. Impressively, fine-tuning a model with this dataset achieved a 67.1% pass@1 score on HumanEval+, outperforming other models, including the much larger CodeLlama-70B-Instruct, by a significant margin.
Benchmark Performance: The SelfCodeAlign-trained models exhibited competitive or superior performance across a range of benchmarks, such as HumanEval+, MBPP, LiveCodeBench, EvoEval, and EvalPerf. The results suggest that self-aligned data can be exceptionally effective in enhancing model performance.
Instruction Generation without Distillation: This pipeline achieves state-of-the-art coding performance without resorting to the traditional knowledge distillation approach from more extensive and heavily controlled proprietary systems like GPT-3.5 or GPT-4, facilitating more permissive and ethically aligned model distribution.
Broad Applicability Across Model Sizes: The pipeline's effectiveness was demonstrated across a spectrum of model sizes, from 3 billion to 33 billion parameters, affirming its versatility.

Theoretical and Practical Implications

The theoretical implications of this work underscore the viability of utilizing a model's intrinsic capabilities for self-alignment in data generation, challenging the prevailing narrative that stronger teacher models are necessary. Practically, SelfCodeAlign represents a significant step towards democratizing advanced code generation tools, as it reduces reliance on costly human involvement and avoids the restrictions associated with proprietary models. This opens the path to more widespread and ethically guided deployment of such models.

Future Directions

The research invites several potential future directions. Extending the method to handle long-context instruction-response pairs could be significant, broadening the applicability of the approach. Furthermore, incorporating reinforcement learning from self-generated negatives, simplifying the process of generating reliable test cases, and expanding the evaluation to more complex programming tasks are promising areas for further exploration.

Conclusion

Overall, this paper provides a compelling case for the benefits of self-alignment in training code generation models, applying a methodology that is both innovative and impactful. By revealing strong performance without the constraints of traditional approaches, it sets a new standard for the development of code generation technologies.

PDF Markdown Bookmark Chat (Pro)

References (83)

Authors (10)

Yuxiang Wei (40 papers)
Federico Cassano (16 papers)
Jiawei Liu (156 papers)
Yifeng Ding (22 papers)
Naman Jain (34 papers)
Zachary Mueller (1 paper)
Harm de Vries (29 papers)
Leandro von Werra (19 papers)
Arjun Guha (44 papers)
Lingming Zhang (48 papers)

Citations (2)

View on Semantic Scholar

Tweets

https://twitter.com/YuxiangWei9/status/1852421529897972207

https://twitter.com/gm8xx8/status/1852176411773038605

https://twitter.com/fly51fly/status/1852470006090789181

https://twitter.com/ceobillionaire/status/1853119589300711601

https://twitter.com/ComputerPapers/status/1852270497175597542

https://twitter.com/arXivGPT/status/1853139132500898032

YouTube

Show All Videos