Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences (2403.09032v2)

Published 14 Mar 2024 in cs.SE, cs.CL, and cs.LG
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

Abstract: Evaluating the alignment of LLMs with user-defined coding preferences is a challenging endeavour that requires a deep assessment of LLMs' outputs. Existing methods and benchmarks rely primarily on automated metrics and static analysis tools, which often fail to capture the nuances of user instructions and LLM outputs. To address this gap, we propose using the LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding preferences. Based on this approach, we present CodeUltraFeedback, a comprehensive dataset designed to facilitate the evaluation and improvement of LLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 LLMs. These responses are ranked based on five distinct coding preferences using GPT-3.5 as a judge, providing both numerical scores and detailed textual feedback. Our analysis of CodeUltraFeedback reveals that responses from GPT-3.5 and GPT-4 are generally preferred over those from open-weight LLMs, highlighting significant differences in alignment between closed and open-weight models. In turn, we explore the usage of CodeUltraFeedback as feedback data to fine-tune and align CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO). The resulting aligned CodeLlama-7B-Instruct model outperforms larger LLMs in terms of alignment with coding preferences and shows improved functional correctness on the HumanEval+ benchmark compared to the original instruct model. Therefore, our contributions bridge the gap in preference tuning of LLMs for code and set the stage for further advancements in model alignment and RLAIF in automated software engineering.

Aligning LLMs to Coding Preferences: Introducing CodeUltraFeedback and CODAL-Bench

Introduction to CodeUltraFeedback and CODAL-Bench

Recent advancements have significantly extended the capabilities of LLMs in the domain of code generation, presenting new challenges and opportunities in aligning these models with specific coding preferences. A paramount issue in current research is the assessment of LLM-generated code, particularly in the context of non-functional requirements such as code readability, efficiency, and adherence to best practices. Traditional benchmarks do not adequately address these criteria, focusing instead on functional correctness or using rigid metrics that fail to capture the nuanced requirements of developers and users. In this paper, we present CodeUltraFeedback, a preference dataset containing 10,000 complex instructions, and CODAL-Bench, a benchmark constructed for evaluating LLM alignment over five coding preferences, including instruction following, code explanation, complexity and efficiency, readability, and coding style.

The Significance of Coding Preferences

Coding preferences, often encompassing non-functional requirements, significantly influence the quality, maintainability, and performance of code. Yet, existing methodologies for evaluating LLMs largely overlook these aspects. This gap highlights the necessity for approaches tailored to measure and tune LLMs according to such preferences. By focusing on a diversified set of preferences, our work aims to bring LLMs closer to meeting developer expectations, thereby enhancing the utility of their generated code in practical scenarios.

Constructing CodeUltraFeedback

The creation of CodeUltraFeedback involves a multistep process, starting with the definition of coding preferences and corresponding principles. The dataset includes responses from 14 diverse LLMs to complex instructions across the defined preferences, annotated using an LLM-as-a-Judge approach with GPT-3.5. This approach ensures both numerical ratings and textual feedback, providing a rich basis for understanding and improving LLM alignment with coding preferences. The methodology for dataset construction emphasizes the importance of diversity in LLM responses and the nuanced assessment of their alignment with coding preferences, setting the stage for comprehensive preference tuning.

The Role of CODAL-Bench

CODAL-Bench is introduced as a means to thoroughly evaluate the alignment of LLMs with the defined coding preferences. Through a meticulous single-answer grading scheme, CODAL-Bench leverages advanced LLMs such as GPT-3.5-Turbo or GPT-4-Turbo as judges, offering a nuanced approach to benchmarking. This strategy moves beyond the limitations of automated metrics and external tools commonly used in other benchmarks, enabling a more refined evaluation of LLM-generated code from a human-centric perspective.

Empirical Insights from Initial Experiments

Our exploration into CodeUltraFeedback's annotations underscores the robust judging capabilities of GPT-3.5-Turbo, which consistently recognized the superior quality of responses from LLMs like GPT-4-Turbo. These findings not only validate the efficacy of CodeUltraFeedback for preference tuning but also suggest an inherent lack of alignment in a majority of tested LLMs, including some of the more sophisticated models.

Advancing LLM Alignment with Coding Preferences

Further experiments demonstrate that tuning a smaller LLM, CodeLlama-7B-Instruct, using CodeUltraFeedback with Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), significantly enhances alignment with coding preferences. This improvement is evident across all preferences on CODAL-Bench, outstripping larger LLMs and underscoring the potential of our approach in refining model alignment. Moreover, this alignment process also results in better functional correctness, as measured on benchmarks such as HumanEval+, showcasing the dual benefits of our tuning methodology.

Conclusion and Outlook

By introducing CodeUltraFeedback and CODAL-Bench, our work takes a significant step toward addressing the challenges of aligning LLMs with coding preferences. The insights garnered from our empirical analyses affirm the utility of these resources in enhancing the capabilities of LLMs to meet developer expectations. As we look to the future, we envision expanded research into LLM tuning and evaluation methodologies, leveraging the foundational contributions of our work to foster further advancements in code intelligence.

Our materials, including models, datasets, benchmarks, and prompt templates, are openly available for researchers and practitioners interested in exploring and advancing the alignment of LLMs with coding preferences. We anticipate that the continued development and refinement of such resources will pave the way for more intuitive, efficient, and functionally robust code generation capabilities in LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. M. Chen and others., “Evaluating large language models trained on code,” 2021.
  2. B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  3. D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li et al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv:2401.14196, 2024.
  4. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
  5. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  6. N. Muennighoff, Q. Liu, A. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo, S. Singh, X. Tang, L. Von Werra, and S. Longpre, “Octopack: Instruction tuning code large language models,” arXiv preprint arXiv:2308.07124, 2023.
  7. J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” arXiv preprint arXiv:2305.01210, 2023.
  8. B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y. Tian, M. Tan, W. U. Ahmad, S. Wang, Q. Sun, M. Shang et al., “Multi-lingual evaluation of code generation models,” arXiv preprint arXiv:2210.14868, 2022.
  9. M. Weyssow, X. Zhou, K. Kim, D. Lo, and H. Sahraoui, “Exploring parameter-efficient fine-tuning techniques for code generation with large language models,” arXiv preprint arXiv:2308.10462, 2023.
  10. Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li et al., “Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x,” arXiv preprint arXiv:2303.17568, 2023.
  11. M. Jiao, T. Yu, X. Li, G. Qiu, X. Gu, and B. Shen, “On the evaluation of neural code translation: Taxonomy and benchmark,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2023, pp. 1529–1541.
  12. R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Understanding the effectiveness of large language models in code translation,” arXiv preprint arXiv:2308.03109, 2023.
  13. A. Silva, S. Fang, and M. Monperrus, “Repairllama: Efficient representations and fine-tuned adapters for program repair,” arXiv preprint arXiv:2312.15698, 2023.
  14. H. Ye, M. Martinez, T. Durieux, and M. Monperrus, “A comprehensive study of automatic program repair on the quixbugs benchmark,” Journal of Systems and Software, vol. 171, p. 110825, 2021.
  15. S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,” arXiv preprint arXiv:2102.04664, 2021.
  16. R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, J. Dolby, J. Chen, M. Choudhury, L. Decker et al., “Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks,” arXiv preprint arXiv:2105.12655, 2021.
  17. M. Zhu, A. Jain, K. Suresh, R. Ravindran, S. Tipirneni, and C. K. Reddy, “Xlcost: A benchmark dataset for cross-lingual code intelligence,” arXiv preprint arXiv:2206.08474, 2022.
  18. C. Niu, C. Li, V. Ng, and B. Luo, “Crosscodebench: Benchmarking cross-task generalization of source code models,” arXiv preprint arXiv:2302.04030, 2023.
  19. X. Zhou, K. Kim, B. Xu, D. Han, J. He, and D. Lo, “Generation-based code review automation: How far are we?” arXiv preprint arXiv:2303.07221, 2023.
  20. O. B. Sghaier and H. Sahraoui, “Improving the learning of code review successive tasks with cross-task knowledge distillation,” arXiv preprint arXiv:2402.02063, 2024.
  21. O. B. Sghaier, L. Maes, and H. Sahraoui, “Unity is strength: Cross-task knowledge distillation to improve code review generation,” arXiv preprint arXiv:2309.03362, 2023.
  22. O. B. Sghaier and H. Sahraoui, “A multi-step learning approach to assist code review,” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).   IEEE, 2023, pp. 450–460.
  23. D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring coding challenge competence with apps,” NeurIPS, 2021.
  24. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021.
  25. M. A. M. Khan, M. S. Bari, X. L. Do, W. Wang, M. R. Parvez, and S. Joty, “xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval,” arXiv preprint arXiv:2303.03004, 2023.
  26. M. L. Siddiq, B. Casey, and J. Santos, “A lightweight framework for high-quality code generation,” arXiv preprint arXiv:2307.08220, 2023.
  27. D. Huang, J. M. Zhang, Y. Qing, and H. Cui, “Effibench: Benchmarking the efficiency of automatically generated code,” arXiv preprint arXiv:2402.02037, 2024.
  28. B. Yetiştiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” arXiv preprint arXiv:2304.10778, 2023.
  29. M. Singhal, T. Aggarwal, A. Awasthi, N. Natarajan, and A. Kanade, “Nofuneval: Funny how code lms falter on requirements beyond functional correctness,” arXiv preprint arXiv:2401.15963, 2024.
  30. G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting language models with high-quality feedback,” arXiv preprint arXiv:2310.01377, 2023.
  31. Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022.
  32. H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, C. Bishop, V. Carbune, and A. Rastogi, “Rlaif: Scaling reinforcement learning from human feedback with ai feedback,” arXiv preprint arXiv:2309.00267, 2023.
  33. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback, 2022,” URL https://arxiv. org/abs/2203.02155, vol. 13, 2022.
  34. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” arXiv preprint arXiv:2306.05685, 2023.
  35. X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpacaeval: An automatic evaluator of instruction-following models,” https://github.com/tatsu-lab/alpaca_eval, 2023.
  36. R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” arXiv preprint arXiv:2305.18290, 2023.
  37. Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,” arXiv preprint arXiv:2306.08568, 2023.
  38. L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib et al., “Zephyr: Direct distillation of lm alignment,” arXiv preprint arXiv:2310.16944, 2023.
  39. T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.
  40. Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang, “Magicoder: Source code is all you need,” arXiv preprint arXiv:2312.02120, 2023.
  41. Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, and C. Gan, “Principle-driven self-alignment of language models from scratch with minimal human supervision,” arXiv preprint arXiv:2305.03047, 2023.
  42. S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,” arXiv preprint arXiv:2306.02707, 2023.
  43. C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang, “Wizardlm: Empowering large language models to follow complex instructions,” arXiv preprint arXiv:2304.12244, 2023.
  44. S. M. Bsharat, A. Myrzakhan, and Z. Shen, “Principled instructions are all you need for questioning llama-1/2, gpt-3.5/4,” arXiv preprint arXiv:2312.16171, 2023.
  45. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
  46. Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
  47. R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpaca: A strong, replicable instruction-following model,” Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, vol. 3, no. 6, p. 7, 2023.
  48. X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou, “A survey on knowledge distillation of large language models,” 2024.
  49. v. W. Leandro, B. Younes, T. Lewis, B. Edward, T. Tristan, L. Nathan, and H. Shengyi, “Trl: Transformer reinforcement learning,” https://huggingface.co/docs/trl/en/index, 2020.
  50. T. Wang, P. Yu, X. E. Tan, S. O’Brien, R. Pasunuru, J. Dwivedi-Yu, O. Golovneva, L. Zettlemoyer, M. Fazel-Zarandi, and A. Celikyilmaz, “Shepherd: A critic for language model generation,” arXiv preprint arXiv:2308.04592, 2023.
  51. S. Kim, J. Shin, Y. Cho, J. Jang, S. Longpre, H. Lee, S. Yun, S. Shin, S. Kim, J. Thorne et al., “Prometheus: Inducing fine-grained evaluation capability in language models,” arXiv preprint arXiv:2310.08491, 2023.
  52. X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpacaeval: An automatic evaluator of instruction-following models,” GitHub repository, 2023.
  53. Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and reliable benchmark for data science code generation,” in International Conference on Machine Learning.   PMLR, 2023, pp. 18 319–18 345.
  54. F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman et al., “Multipl-e: a scalable and polyglot approach to benchmarking neural code generation,” IEEE Transactions on Software Engineering, 2023.
  55. S. Zhou, U. Alon, F. F. Xu, Z. Jiang, and G. Neubig, “Docprompting: Generating code by retrieving the docs,” in The Eleventh International Conference on Learning Representations, 2022.
  56. H. M. Babe, S. Nguyen, Y. Zi, A. Guha, M. Q. Feldman, and C. J. Anderson, “Studenteval: A benchmark of student-written prompts for large language models of code,” 2023.
  57. X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha, X. Peng, and Y. Lou, “Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation,” arXiv preprint arXiv:2308.01861, 2023.
  58. H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, and T. Xie, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–12.
  59. M. Allamanis and C. Sutton, “Mining source code repositories at massive scale using language modeling,” in 2013 10th working conference on mining software repositories (MSR).   IEEE, 2013, pp. 207–216.
  60. V. Raychev, P. Bielik, and M. Vechev, “Probabilistic model for code with decision trees,” ACM SIGPLAN Notices, vol. 51, no. 10, pp. 731–747, 2016.
  61. V. Raychev, P. Bielik, M. Vechev, and A. Krause, “Learning programs from noisy data,” ACM Sigplan Notices, vol. 51, no. 1, pp. 761–774, 2016.
  62. H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436, 2019.
  63. S. Wang, Z. Li, H. Qian, C. Yang, Z. Wang, M. Shang, V. Kumar, S. Tan, B. Ray, P. Bhatia et al., “Recode: Robustness evaluation of code generation models,” arXiv preprint arXiv:2212.10264, 2022.
  64. Y. Ding, Z. Wang, W. U. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth et al., “Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion,” arXiv preprint arXiv:2310.11248, 2023.
  65. Z. Yang, Z. Sun, T. Z. Yue, P. Devanbu, and D. Lo, “Robustness, security, privacy, explainability, efficiency, and usability of large language models for code,” 2024.
  66. Y. Liu, T. Le-Cong, R. Widyasari, C. Tantithamthavorn, L. Li, X.-B. D. Le, and D. Lo, “Refining chatgpt-generated code: Characterizing and mitigating code quality issues,” ACM Transactions on Software Engineering and Methodology, 2023.
  67. M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana et al., “Purple llama cyberseceval: A secure coding benchmark for language models,” arXiv preprint arXiv:2312.04724, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Martin Weyssow (16 papers)
  2. Aton Kamanda (1 paper)
  3. Houari Sahraoui (31 papers)
Citations (15)