Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct (2407.05700v2)

Published 8 Jul 2024 in cs.CL, cs.AI, and cs.SE

Abstract: Recent advancements in open-source code LLMs have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  3. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
  4. Improving image generation with better captions. URL https://api.semanticscholar.org/CorpusID:264403242.
  5. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49:3675–3691, 2023. URL https://api.semanticscholar.org/CorpusID:258205341.
  6. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  8. Self-play fine-tuning converts weak language models to strong language models, 2024.
  9. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2022. URL https://api.semanticscholar.org/CorpusID:247951931.
  10. Pangu-coder: Program synthesis with function-level language modeling. ArXiv, abs/2207.11280, 2022. URL https://api.semanticscholar.org/CorpusID:251040785.
  11. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  12. Stepcoder: Improve code generation with reinforcement learning from compiler feedback, 2024.
  13. Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022. URL https://api.semanticscholar.org/CorpusID:248157108.
  14. Learning instructions with unlabeled data for zero-shot cross-task generalization. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:252918165.
  15. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. ArXiv, abs/2401.14196, 2024. URL https://api.semanticscholar.org/CorpusID:267211867.
  16. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  17. Language models can teach themselves to program better. ArXiv, abs/2207.14502, 2022. URL https://api.semanticscholar.org/CorpusID:251197051.
  18. Efficient memory management for large language model serving with pagedattention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023. URL https://api.semanticscholar.org/CorpusID:261697361.
  19. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
  20. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In NeurIPS, 2022.
  21. Starcoder: may the source be with you! ArXiv, abs/2305.06161, 2023a. URL https://api.semanticscholar.org/CorpusID:258588247.
  22. Self-alignment with instruction backtranslation. ArXiv, abs/2308.06259, 2023b. URL https://api.semanticscholar.org/CorpusID:260866107.
  23. Competition-level code generation with alphacode. Science, 378:1092 – 1097, 2022. URL https://api.semanticscholar.org/CorpusID:246527904.
  24. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
  25. Rltf: Reinforcement learning from unit test feedback. ArXiv, abs/2307.04349, 2023a. URL https://api.semanticscholar.org/CorpusID:259501019.
  26. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. ArXiv, abs/2305.01210, 2023b. URL https://api.semanticscholar.org/CorpusID:258437095.
  27. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
  28. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  29. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
  30. Microsoft. Github copilot – your ai pair programmer. https://github.com/features/copilot, 2023.
  31. Octopack: Instruction tuning code large language models. ArXiv, abs/2308.07124, 2023. URL https://api.semanticscholar.org/CorpusID:260886874.
  32. Llms for science: Usage for code generation and data analysis. arXiv preprint arXiv:2311.16733, 2023.
  33. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  34. OpenAI. Chatgpt: Optimizing language models for dialogue, 2022.
  35. R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2023.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  37. Code llama: Open foundation models for code. ArXiv, abs/2308.12950, 2023. URL https://api.semanticscholar.org/CorpusID:261100919.
  38. Adafactor: Adaptive learning rates with sublinear memory cost. ArXiv, abs/1804.04235, 2018. URL https://api.semanticscholar.org/CorpusID:4786918.
  39. Pangu-coder2: Boosting large language models for code with ranking feedback, 2023.
  40. Execution-based code generation using deep reinforcement learning. ArXiv, abs/2301.13816, 2023. URL https://api.semanticscholar.org/CorpusID:256416258.
  41. Learning performance-improving code edits. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ix7rLVHXyY.
  42. One embedder, any task: Instruction-finetuned text embeddings. ArXiv, abs/2212.09741, 2022. URL https://api.semanticscholar.org/CorpusID:254853816.
  43. Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv, abs/2305.03047, 2023. URL https://api.semanticscholar.org/CorpusID:258479665.
  44. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023.
  45. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment. arXiv preprint arXiv:2402.12275, 2024.
  46. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  47. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  48. theblackcat102. The evolved code alpaca dataset. https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1, 2023.
  49. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  50. A survey on data selection for llm instruction tuning. ArXiv, abs/2402.05123, 2024. URL https://api.semanticscholar.org/CorpusID:267547917.
  51. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  52. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, 2021.
  53. Codet5+: Open code large language models for code understanding and generation. In Conference on Empirical Methods in Natural Language Processing, 2023b. URL https://api.semanticscholar.org/CorpusID:258685677.
  54. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  55. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
  56. Starcoder2-instruct: Fully transparent and permissive self-alignment for code generation, 2024. URL https://github.com/bigcode-project/starcoder2-self-align.
  57. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  58. Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:258841263.
  59. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint arXiv:2312.14187, 2023.
  60. Self-rewarding language models, 2024.
  61. Automathtext: Autonomous data selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625, 2024.
  62. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. ArXiv, abs/2303.17568, 2023. URL https://api.semanticscholar.org/CorpusID:257834177.
  63. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Yutong Wu (25 papers)
  2. Di Huang (203 papers)
  3. Wenxuan Shi (7 papers)
  4. Wei Wang (1793 papers)
  5. Lingzhe Gao (1 paper)
  6. Shihao Liu (10 papers)
  7. Ziyuan Nan (5 papers)
  8. Kaizhao Yuan (3 papers)
  9. Rui Zhang (1138 papers)
  10. Xishan Zhang (22 papers)
  11. Zidong Du (41 papers)
  12. Qi Guo (237 papers)
  13. Yewen Pu (27 papers)
  14. Dawei Yin (165 papers)
  15. Xing Hu (122 papers)
  16. Yunji Chen (51 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com