SelfCodeAlign: Self-Alignment for Code Generation
Abstract: Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of LLMs to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.
- Avatar: A parallel corpus for java-python program translation. arXiv preprint arXiv:2108.11590, 2021.
- Anthropic. Terms of service, 7 2023. Accessed: August 17, 2023.
- Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
- Longalign: A recipe for long context alignment of large language models, 2024.
- Knowledge transfer from high-resource to low-resource programming languages for Code LLMs, 2024.
- Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions. In The First International Workshop on Large Language Model for Code, 2024.
- S. Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
- Evaluating large language models trained on code, 2021.
- Large language models for compiler optimization. arXiv preprint arXiv:2309.07062, 2023.
- Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models, 2023.
- Horizon-length prediction: Advancing fill-in-the-middle capabilities for code generation with lookahead planning. arXiv preprint arXiv:2410.03103, 2024.
- XFT: Unlocking the power of code instruction tuning by simply merging upcycled mixture-of-experts. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12941–12955, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics.
- Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861, 2023.
- The llama 3 herd of models, 2024.
- Rlef: Grounding code llms in execution feedback with reinforcement learning, 2024.
- A. Gomez. Introducing command r+: A scalable llm built for business, April 4 2024. Accessed: 2024-05-22.
- Google. Generative ai terms of service, 8 2023. Accessed: August 17, 2023.
- Textbooks are all you need, 2023.
- Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
- Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations, 2023.
- Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
- Mistral 7b, 2023.
- Mixtral of experts, 2024.
- Impact of code language models on automated program repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1430–1442. IEEE, 2023.
- Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
- Swe-bench: Can language models resolve real-world github issues?, 2023.
- Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1646–1656, 2023.
- The stack: 3 tb of permissively licensed source code, 2022.
- Openassistant conversations - democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Ds-1000: A natural and reliable benchmark for data science code generation, 2022.
- CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 919–931. IEEE, 2023.
- Starcoder: may the source be with you!, 2023.
- Self-alignment with instruction backtranslation. In The Twelfth International Conference on Learning Representations, 2024.
- Learning code preference via synthetic evolution. arXiv preprint arXiv:2410.03837, 2024.
- Large language model-based agents for software engineering: A survey. arXiv preprint arXiv:2409.02977, 2024.
- Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Evaluating language models for efficient code generation. In First Conference on Language Modeling, 2024.
- Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
- Wizardcoder: Empowering code large language models with evol-instruct, 2023.
- Large language model guided protocol fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS), 2024.
- Octopack: Instruction tuning code large language models, 2023.
- nickrosh. Open Source Implementation of Evol-Instruct-Code. https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1, 2023.
- Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023.
- OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- OpenAI. Terms of service, 3 2023. Accessed: August 17, 2023.
- OpenAI. Gpt-4o system card. 2024.
- Training language models to follow instructions with human feedback, 2022.
- Understanding the effectiveness of large language models in code translation. arXiv preprint arXiv:2308.03109, 2023.
- Improving language understanding by generative pre-training. 2018.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press, 2020.
- S. A. Research. Snowflake arctic: The best llm for enterprise ai — efficiently intelligent, truly open, April 24 2024. Accessed: 2024-05-22.
- Unsupervised translation of programming languages. Advances in neural information processing systems, 33:20601–20611, 2020.
- Code llama: Open foundation models for code, 2023.
- N. Shazeer and M. Stern. Adafactor: Adaptive learning rates with sublinear memory cost, 2018.
- Learning performance-improving code edits. arXiv preprint arXiv:2302.07867, 2023.
- SALMON: Self-alignment with instructable reward models. In The Twelfth International Conference on Learning Representations, 2024.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Codegemma: Open code models based on gemma, 2024.
- G. Team. Gemini: A family of highly capable multimodal models, 2024.
- Q. Team. Code with codeqwen1.5, April 16 2024. Accessed: 2024-05-20.
- theblackcat102. The evolved code alpaca dataset. https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Openhands: An open platform for ai software developers as generalist agents, 2024.
- Self-instruct: Aligning language models with self-generated instructions. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics.
- Codet5+: Open code large language models for code understanding and generation, 2023.
- CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
- Arctic-snowcoder: Demystifying high-quality data in code pretraining. arXiv preprint arXiv:2409.02326, 2024.
- Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
- Copiloting the copilots: Fusing large language models with completion engines for automated program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 172–184, New York, NY, USA, 2023. Association for Computing Machinery.
- Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. arXiv preprint arXiv:2403.09032, 2024.
- Agentless: Demystifying llm-based software engineering agents. arXiv preprint, 2024.
- Top leaderboard ranking= top coding proficiency, always? evoeval: Evolving coding benchmarks via llm. arXiv preprint arXiv:2403.19114, 2024.
- Universal fuzzing via large language models, 2023.
- Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494, 2023.
- C. S. Xia and L. Zhang. Less training, more repairing please: Revisiting automated program repair via zero-shot learning, 2022.
- Swe-agent: Agent-computer interfaces enable automated software engineering, 2024.
- Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation, 2024.
- Self-rewarding language models, 2024.
- Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024.
- Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.