Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules (2310.08992v3)

Published 13 Oct 2023 in cs.AI, cs.CL, and cs.PL
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules

Abstract: LLMs have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks instead of decomposing them into logical sub-tasks and sub-modules. On the other hand, experienced programmers instinctively write modularized code with abstraction for solving complex tasks, often reusing previously developed modules. To address this gap, we propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions, each being guided by some representative sub-modules generated in previous iterations. Concretely, CodeChain first instructs the LLM to generate modularized codes through chain-of-thought prompting. Then it applies a chain of self-revisions by iterating the two steps: 1) extracting and clustering the generated sub-modules and selecting the cluster representatives as the more generic and re-usable implementations, and 2) augmenting the original chain-of-thought prompt with these selected module-implementations and instructing the LLM to re-generate new modularized solutions. We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation studies with different methods of prompting, number of clusters, model sizes, program qualities, etc., to provide useful insights that underpin CodeChain's success.

Introduction to CodeChain

The process of writing high-quality computer programs often involves breaking down complex tasks into smaller, more manageable components called sub-modules, essentially crafting a solution piece by piece. This is a programming paradigm that human developers commonly use but has been notably challenging for LLMs. The paper introduces CodeChain, a novel framework designed to elicit a similar modular approach in code generation from LLMs. It strategically prompts these models to decompose tasks into sub-modules, revising and improving them iteratively to construct a comprehensive solution.

Modularity in AI-Generated Code

The framework starts by encouraging an LLM to outline a problem solution in sub-modules using chain-of-thought (CoT) prompting. Although prompting alone sometimes decreases the correctness of generated solutions, because models are not innately trained to create perfectly modular structures, CodeChain introduces an iterative process of self-revisions. In this process, a selection of sub-modules from these initial outputs is chosen based on their potential for reuse and generic applicability. These sub-modules then form the basis for a new generation round, prompting the LLM to generate improved, modularized solutions.

The Chain of Self-Revisions

A key element of CodeChain is the method of extracting and clustering sub-modules from generated code, then using the most exemplary elements of these clusters in subsequent revisions. This iterative clustering and self-refinement encourages models to internalize and iterate upon the most reusable code components. The framework provides a means of iterative learning that mirrors the process experienced developers may undertake—refining, debugging, and reusing portions of code as needed until a satisfactory solution is achieved.

Results and Insights

Extensive experiments utilizing CodeChain with various LLMs, including OpenAI's models and the open-sourced WizardCoder, demonstrated a significant increase in both the modularity and correctness of the generated code. CodeChain marked improvements over traditional methods, particularly in challenging coding tasks. The insights from ablation studies further emphasized the importance of the clustering selection process and revising in improving the generated code.

In conclusion, CodeChain opens up new possibilities for advanced, modular code generation by LLMs, reflecting a more human-like approach to problem-solving in programming. The framework's ability to guide LLMs in the direction of generating increasingly modularized, correct, and sophisticated code solutions represents a significant stride in the field of AI-driven code generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  2. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. URL https://doi. org/10.5281/zenodo, 5297715, 2021.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749, 2023a.
  5. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=ktrw68Cmu9c.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  7. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023c.
  8. PyMT5: multi-mode translation of natural language and python code with transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9052–9065, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.728. URL https://aclanthology.org/2020.emnlp-main.728.
  9. Robustfill: Neural program learning under noisy i/o. In International conference on machine learning, pp.  990–998. PMLR, 2017.
  10. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1536–1547, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
  11. Spreadsheet data manipulation using examples. Communications of the ACM, 55(8):97–105, 2012.
  12. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  13. Measuring coding challenge competence with apps. NeurIPS, 2021.
  14. Fault-aware neural code rankers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=LtJMqnbslJe.
  15. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  16. Anis Koubaa. Gpt-4 vs. gpt-3.5: A concise showdown. arXiv preprint, 2023.
  17. Neural random-access machines. arXiv preprint arXiv:1511.06392, 2015.
  18. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  19. Ds-1000: A natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022.
  20. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022.
  21. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  22. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814, 2022.
  23. Codexglue: A machine learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks, 2021.
  24. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  25. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  26. Toward automatic program synthesis. Commun. ACM, 14(3):151–165, mar 1971. ISSN 0001-0782. doi: 10.1145/362566.362568. URL https://doi.org/10.1145/362566.362568.
  27. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pp.  26106–26128. PMLR, 2023.
  28. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309, 2023.
  29. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896, 2023.
  30. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  31. Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855, 2016.
  32. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  33. Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.
  34. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  35. Natural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3533–3546, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.231. URL https://aclanthology.org/2022.emnlp-main.231.
  36. Reflexion: Language agents with verbal reinforcement learning, 2023.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  39. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  40. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP (1), pp.  8696–8708. Association for Computational Linguistics, 2021.
  41. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
  42. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  43. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hH36JeQZDaO.
  44. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  45. Self-edit: Fault-aware code editor for code generation. arXiv preprint arXiv:2305.04087, 2023a.
  46. Coder reviewer reranking for code generation. In International Conference on Machine Learning, pp.  41832–41846. PMLR, 2023b.
  47. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WZH7099tgfM.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Hung Le (120 papers)
  2. Hailin Chen (11 papers)
  3. Amrita Saha (23 papers)
  4. Akash Gokul (13 papers)
  5. Doyen Sahoo (47 papers)
  6. Shafiq Joty (187 papers)
Citations (32)
Youtube Logo Streamline Icon: https://streamlinehq.com