Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-Assisted Code Cleaning For Training Accurate Code Generators (2311.14904v1)

Published 25 Nov 2023 in cs.LG and cs.SE

Abstract: Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works have showcased its importance for improving performance. In this work, we investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs by 1.) renaming variables, 2.) modularizing and decomposing complex code into smaller helper sub-functions, and 3.) inserting natural-language based plans via LLM based transformations. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B on our transformed modularized programs improves the performance by up to 30% compared to fine-tuning on the original dataset. Additionally, we demonstrate improved performance from using a smaller amount of higher-quality data, finding that a model fine-tuned on the entire original dataset is outperformed by a model trained on 15% of our cleaned dataset. Even in comparison to closed-source models, our models outperform the much larger AlphaCoder models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Codeplan: Repository-level coding using llms and planning. In Neural Information Processing Systems Workshop on Foundation Models for Decision Making (FMDM-NeurIPS), November 2023.
  2. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
  3. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022a.
  4. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246, 2023a.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  6. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022b.
  7. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023b.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  9. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
  10. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  11. Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations, 2023.
  12. Measuring coding challenge competence with apps. NeurIPS, 2021.
  13. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
  14. Jigsaw: Large language models meet program synthesis. In ICSE 2022.
  15. Self-planning code generation with large language model. arXiv preprint arXiv:2303.06689, 2023.
  16. I speak, you verify: Toward trustworthy neural program synthesis. arXiv preprint arXiv:2210.00848, 2022.
  17. Spoc: Search-based pseudocode to code. Advances in Neural Information Processing Systems, 32, 2019.
  18. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  19. Ds-1000: A natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022.
  20. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  21. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. arXiv preprint arXiv:2310.08992, 2023.
  22. Explaining competitive-level programming solutions using llms. arXiv preprint arXiv:2307.05337, 2023a.
  23. Symbolic chain-of-thought distillation: Small models can also “think” step-by-step. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2665–2679, Toronto, Canada, July 2023b. Association for Computational Linguistics.
  24. Think outside the code: Brainstorming boosts large language models in code generation. arXiv preprint arXiv:2305.10679, 2023c.
  25. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023d.
  26. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  27. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  28. Improving chatgpt prompt for code generation. arXiv preprint arXiv:2305.08360, 2023a.
  29. Rltf: Reinforcement learning from unit test feedback, 2023b.
  30. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023a.
  31. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023b.
  32. Teaching small language models to reason. arXiv preprint arXiv:2212.08410, 2022.
  33. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  34. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  35. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  36. Referee: Reference-free sentence summarization with sharper controllability through symbolic knowledge distillation. arXiv preprint arXiv:2210.13800, 2022.
  37. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  38. Execution-based code generation using deep reinforcement learning. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  39. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning, pp.  31693–31715. PMLR, 2023.
  40. Stanford alpaca: An instruction-following llama model, 2023.
  41. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023.
  42. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  43. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  44. Symbolic knowledge distillation: from general language models to commonsense models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4602–4625, Seattle, United States, July 2022. Association for Computational Linguistics.
  45. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  46. Natural language to code generation in interactive data science notebooks. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  126–173, Toronto, Canada, July 2023. Association for Computational Linguistics.
  47. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  48. STar: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  49. Parsel: A (de-) compositional framework for algorithmic reasoning with language models. arXiv preprint arXiv:2212.10561, 2023.
  50. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570, 2023a.
  51. Algo: Synthesizing algorithmic programs with generated oracle verifiers. arXiv preprint arXiv:2305.14591, 2023b.
  52. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510, 2023c.
  53. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  54. Daniel Fried Zhiruo Wang, Shuyan Zhou and Graham Neubig. Execution-based evaluation for open-domain code generation. 2022.
  55. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  56. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  57. Terry Yue Zhuo. Large language models are state-of-the-art evaluators of code generation. arXiv preprint arXiv:2304.14317, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Naman Jain (34 papers)
  2. Tianjun Zhang (38 papers)
  3. Wei-Lin Chiang (19 papers)
  4. Joseph E. Gonzalez (167 papers)
  5. Koushik Sen (49 papers)
  6. Ion Stoica (177 papers)
Citations (20)
Youtube Logo Streamline Icon: https://streamlinehq.com