Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SelfCodeAlign: Self-Alignment for Code Generation (2410.24198v2)

Published 31 Oct 2024 in cs.CL, cs.LG, and cs.SE
SelfCodeAlign: Self-Alignment for Code Generation

Abstract: Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of LLMs to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.

SelfCodeAlign: Self-Alignment for Code Generation

The paper, "SelfCodeAlign: Self-Alignment for Code Generation," presents an innovative pipeline for enhancing the capabilities of code generation models through a process termed self-alignment. The authors introduce a fully transparent approach that circumvents the dependencies on human annotations and distillation from larger proprietary LLMs, which traditionally acts as a bottleneck in the scaling and legal usability of the models.

Key Contributions

  1. SelfCodeAlign Methodology: The paper introduces SelfCodeAlign, a novel pipeline that allows a code generation model to align itself using self-generated instruction data. This pipeline involves several stages, including concept extraction from seed snippets, instruction generation, response generation, and self-validation through execution in a sandbox environment.
  2. Evaluation Results: The authors used SelfCodeAlign with various models, including CodeQwen1.5, resulting in a dataset of 74,000 instruction-response pairs. Impressively, fine-tuning a model with this dataset achieved a 67.1% pass@1 score on HumanEval+, outperforming other models, including the much larger CodeLlama-70B-Instruct, by a significant margin.
  3. Benchmark Performance: The SelfCodeAlign-trained models exhibited competitive or superior performance across a range of benchmarks, such as HumanEval+, MBPP, LiveCodeBench, EvoEval, and EvalPerf. The results suggest that self-aligned data can be exceptionally effective in enhancing model performance.
  4. Instruction Generation without Distillation: This pipeline achieves state-of-the-art coding performance without resorting to the traditional knowledge distillation approach from more extensive and heavily controlled proprietary systems like GPT-3.5 or GPT-4, facilitating more permissive and ethically aligned model distribution.
  5. Broad Applicability Across Model Sizes: The pipeline's effectiveness was demonstrated across a spectrum of model sizes, from 3 billion to 33 billion parameters, affirming its versatility.

Theoretical and Practical Implications

The theoretical implications of this work underscore the viability of utilizing a model's intrinsic capabilities for self-alignment in data generation, challenging the prevailing narrative that stronger teacher models are necessary. Practically, SelfCodeAlign represents a significant step towards democratizing advanced code generation tools, as it reduces reliance on costly human involvement and avoids the restrictions associated with proprietary models. This opens the path to more widespread and ethically guided deployment of such models.

Future Directions

The research invites several potential future directions. Extending the method to handle long-context instruction-response pairs could be significant, broadening the applicability of the approach. Furthermore, incorporating reinforcement learning from self-generated negatives, simplifying the process of generating reliable test cases, and expanding the evaluation to more complex programming tasks are promising areas for further exploration.

Conclusion

Overall, this paper provides a compelling case for the benefits of self-alignment in training code generation models, applying a methodology that is both innovative and impactful. By revealing strong performance without the constraints of traditional approaches, it sets a new standard for the development of code generation technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Avatar: A parallel corpus for java-python program translation. arXiv preprint arXiv:2108.11590, 2021.
  2. Anthropic. Terms of service, 7 2023. Accessed: August 17, 2023.
  3. Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
  4. Longalign: A recipe for long context alignment of large language models, 2024.
  5. Knowledge transfer from high-resource to low-resource programming languages for Code LLMs, 2024.
  6. Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions. In The First International Workshop on Large Language Model for Code, 2024.
  7. S. Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  8. Evaluating large language models trained on code, 2021.
  9. Large language models for compiler optimization. arXiv preprint arXiv:2309.07062, 2023.
  10. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models, 2023.
  11. Horizon-length prediction: Advancing fill-in-the-middle capabilities for code generation with lookahead planning. arXiv preprint arXiv:2410.03103, 2024.
  12. XFT: Unlocking the power of code instruction tuning by simply merging upcycled mixture-of-experts. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12941–12955, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics.
  13. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861, 2023.
  14. The llama 3 herd of models, 2024.
  15. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2024.
  16. A. Gomez. Introducing command r+: A scalable llm built for business, April 4 2024. Accessed: 2024-05-22.
  17. Google. Generative ai terms of service, 8 2023. Accessed: August 17, 2023.
  18. Textbooks are all you need, 2023.
  19. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
  20. Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations, 2023.
  21. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
  22. Mistral 7b, 2023.
  23. Mixtral of experts, 2024.
  24. Impact of code language models on automated program repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1430–1442. IEEE, 2023.
  25. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
  26. Swe-bench: Can language models resolve real-world github issues?, 2023.
  27. Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1646–1656, 2023.
  28. The stack: 3 tb of permissively licensed source code, 2022.
  29. Openassistant conversations - democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  30. Ds-1000: A natural and reliable benchmark for data science code generation, 2022.
  31. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
  32. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 919–931. IEEE, 2023.
  33. Starcoder: may the source be with you!, 2023.
  34. Self-alignment with instruction backtranslation. In The Twelfth International Conference on Learning Representations, 2024.
  35. Learning code preference via synthetic evolution. arXiv preprint arXiv:2410.03837, 2024.
  36. Large language model-based agents for software engineering: A survey. arXiv preprint arXiv:2409.02977, 2024.
  37. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  38. Evaluating language models for efficient code generation. In First Conference on Language Modeling, 2024.
  39. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
  40. Wizardcoder: Empowering code large language models with evol-instruct, 2023.
  41. Large language model guided protocol fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS), 2024.
  42. Octopack: Instruction tuning code large language models, 2023.
  43. nickrosh. Open Source Implementation of Evol-Instruct-Code. https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1, 2023.
  44. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  45. OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022.
  46. OpenAI. Gpt-4 technical report, 2023.
  47. OpenAI. Terms of service, 3 2023. Accessed: August 17, 2023.
  48. OpenAI. Gpt-4o system card. 2024.
  49. Training language models to follow instructions with human feedback, 2022.
  50. Understanding the effectiveness of large language models in code translation. arXiv preprint arXiv:2308.03109, 2023.
  51. Improving language understanding by generative pre-training. 2018.
  52. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  53. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press, 2020.
  54. S. A. Research. Snowflake arctic: The best llm for enterprise ai — efficiently intelligent, truly open, April 24 2024. Accessed: 2024-05-22.
  55. Unsupervised translation of programming languages. Advances in neural information processing systems, 33:20601–20611, 2020.
  56. Code llama: Open foundation models for code, 2023.
  57. N. Shazeer and M. Stern. Adafactor: Adaptive learning rates with sublinear memory cost, 2018.
  58. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867, 2023.
  59. SALMON: Self-alignment with instructable reward models. In The Twelfth International Conference on Learning Representations, 2024.
  60. Principle-driven self-alignment of language models from scratch with minimal human supervision. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  61. Codegemma: Open code models based on gemma, 2024.
  62. G. Team. Gemini: A family of highly capable multimodal models, 2024.
  63. Q. Team. Code with codeqwen1.5, April 16 2024. Accessed: 2024-05-20.
  64. theblackcat102. The evolved code alpaca dataset. https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1, 2023.
  65. Llama 2: Open foundation and fine-tuned chat models, 2023.
  66. Openhands: An open platform for ai software developers as generalist agents, 2024.
  67. Self-instruct: Aligning language models with self-generated instructions. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics.
  68. Codet5+: Open code large language models for code understanding and generation, 2023.
  69. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  70. Arctic-snowcoder: Demystifying high-quality data in code pretraining. arXiv preprint arXiv:2409.02326, 2024.
  71. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
  72. Copiloting the copilots: Fusing large language models with completion engines for automated program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 172–184, New York, NY, USA, 2023. Association for Computing Machinery.
  73. Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. arXiv preprint arXiv:2403.09032, 2024.
  74. Agentless: Demystifying llm-based software engineering agents. arXiv preprint, 2024.
  75. Top leaderboard ranking= top coding proficiency, always? evoeval: Evolving coding benchmarks via llm. arXiv preprint arXiv:2403.19114, 2024.
  76. Universal fuzzing via large language models, 2023.
  77. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494, 2023.
  78. C. S. Xia and L. Zhang. Less training, more repairing please: Revisiting automated program repair via zero-shot learning, 2022.
  79. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024.
  80. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation, 2024.
  81. Self-rewarding language models, 2024.
  82. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024.
  83. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yuxiang Wei (40 papers)
  2. Federico Cassano (16 papers)
  3. Jiawei Liu (156 papers)
  4. Yifeng Ding (22 papers)
  5. Naman Jain (34 papers)
  6. Zachary Mueller (1 paper)
  7. Harm de Vries (29 papers)
  8. Leandro von Werra (19 papers)
  9. Arjun Guha (44 papers)
  10. Lingming Zhang (48 papers)
Citations (2)
Youtube Logo Streamline Icon: https://streamlinehq.com