Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Semi-Instruct: Bridging Natural-Instruct and Self-Instruct for Code Large Language Models (2403.00338v1)

Published 1 Mar 2024 in cs.CL

Abstract: Instruction tuning plays a pivotal role in Code LLMs (Code LLMs) for the task of program synthesis. Presently, two dominant paradigms for collecting tuning data are natural-instruct (human-written) and self-instruct (automatically generated). Natural-instruct includes diverse and correct codes but lacks instruction-code pairs, and exists improper code formats like nested single-line codes. In contrast, self-instruct automatically generates proper paired data. However, it suffers from low diversity due to generating duplicates and cannot ensure the correctness of codes. To bridge the both paradigms, we propose \textbf{Semi-Instruct}. It first converts diverse but improper codes from natural-instruct into proper instruction-code pairs through a method similar to self-instruct. To verify the correctness of generated codes, we design a novel way to construct test cases by generating cases' inputs and executing correct codes from natural-instruct to get outputs. Finally, diverse and correct instruction-code pairs are retained for instruction tuning. Experiments show that semi-instruct is significantly better than natural-instruct and self-instruct. Furthermore, the performance steadily improves as data scale increases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Unified pre-training for program understanding and generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2668, Online. Association for Computational Linguistics.
  2. Santacoder: don’t reach for the stars!
  3. Andrea Arcuri. 2017. Many independent objective (mio) algorithm for test suite generation. In Search Based Software Engineering: 9th International Symposium, SSBSE 2017, Paderborn, Germany, September 9-11, 2017, Proceedings 9, pages 3–17. Springer.
  4. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 41–48, New York, NY, USA. Association for Computing Machinery.
  5. Johannes Bohnet and Jürgen Döllner. 2011. Monitoring code quality and development activity by software maps. In Proceedings of the 2nd Workshop on Managing Technical Debt, MTD ’11, page 9–16, New York, NY, USA. Association for Computing Machinery.
  6. Ethan Caballero and Ilya Sutskever. 2016. Description2code dataset.
  7. Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation.
  8. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  10. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.
  11. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations.
  12. Textbooks are all you need. arXiv preprint arXiv:2306.11644.
  13. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pages 21314–21328. Curran Associates, Inc.
  14. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  15. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  16. Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pages 168–172.
  17. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  18. Zohar Manna and Richard Waldinger. 1980. A deductive approach to program synthesis. ACM Trans. Program. Lang. Syst., 2:90–121.
  19. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
  20. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124.
  21. Codegen: An open large language model for code with multi-turn program synthesis.
  22. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  23. Pangu-coder2: Boosting large language models for code with ranking feedback.
  24. Richard Shin and Benjamin Van Durme. 2022. Few-shot semantic parsing with language models trained on code. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5417–5425, Seattle, United States. Association for Computational Linguistics.
  25. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  26. Unit test case generation with transformers and focal context.
  27. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics.
  28. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922.
  29. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  30. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  31. Generating natural language proofs with verifier-guided search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 89–105, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  32. Large language models meet NL2Code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, Toronto, Canada. Association for Computational Linguistics.
  33. Self-edit: Fault-aware code editor for code generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 769–787. Association for Computational Linguistics.
  34. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  35. Measuring Coding Challenge Competence With APPS. Curran. PID https://github.com/hendrycks/apps.
  36. Codesearchnet challenge: Evaluating the state of semantic code search. PID https://huggingface.co/datasets/code_search_net.
  37. The stack: 3 tb of permissively licensed source code. PID https://huggingface.co/datasets/bigcode/the-stack.
  38. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. Curran. PID https://github.com/IBM/Project_CodeNet/tree/main.
  39. Frank F. Xu and Uri Alon and Graham Neubig and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. ACM. PID https://github.com/VHellendoorn/Code-LMs.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xianzhen Luo (11 papers)
  2. Qingfu Zhu (39 papers)
  3. Zhiming Zhang (17 papers)
  4. Xu Wang (319 papers)
  5. Qing Yang (138 papers)
  6. Dongliang Xu (19 papers)
  7. Wanxiang Che (152 papers)
Citations (1)