InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct (2407.05700v2)
Abstract: Recent advancements in open-source code LLMs have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
- Improving image generation with better captions. URL https://api.semanticscholar.org/CorpusID:264403242.
- Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49:3675–3691, 2023. URL https://api.semanticscholar.org/CorpusID:258205341.
- Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Self-play fine-tuning converts weak language models to strong language models, 2024.
- Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2022. URL https://api.semanticscholar.org/CorpusID:247951931.
- Pangu-coder: Program synthesis with function-level language modeling. ArXiv, abs/2207.11280, 2022. URL https://api.semanticscholar.org/CorpusID:251040785.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
- Stepcoder: Improve code generation with reinforcement learning from compiler feedback, 2024.
- Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022. URL https://api.semanticscholar.org/CorpusID:248157108.
- Learning instructions with unlabeled data for zero-shot cross-task generalization. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:252918165.
- Deepseek-coder: When the large language model meets programming - the rise of code intelligence. ArXiv, abs/2401.14196, 2024. URL https://api.semanticscholar.org/CorpusID:267211867.
- Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
- Language models can teach themselves to program better. ArXiv, abs/2207.14502, 2022. URL https://api.semanticscholar.org/CorpusID:251197051.
- Efficient memory management for large language model serving with pagedattention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023. URL https://api.semanticscholar.org/CorpusID:261697361.
- Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
- Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In NeurIPS, 2022.
- Starcoder: may the source be with you! ArXiv, abs/2305.06161, 2023a. URL https://api.semanticscholar.org/CorpusID:258588247.
- Self-alignment with instruction backtranslation. ArXiv, abs/2308.06259, 2023b. URL https://api.semanticscholar.org/CorpusID:260866107.
- Competition-level code generation with alphacode. Science, 378:1092 – 1097, 2022. URL https://api.semanticscholar.org/CorpusID:246527904.
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
- Rltf: Reinforcement learning from unit test feedback. ArXiv, abs/2307.04349, 2023a. URL https://api.semanticscholar.org/CorpusID:259501019.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. ArXiv, abs/2305.01210, 2023b. URL https://api.semanticscholar.org/CorpusID:258437095.
- Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
- Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
- Microsoft. Github copilot – your ai pair programmer. https://github.com/features/copilot, 2023.
- Octopack: Instruction tuning code large language models. ArXiv, abs/2308.07124, 2023. URL https://api.semanticscholar.org/CorpusID:260886874.
- Llms for science: Usage for code generation and data analysis. arXiv preprint arXiv:2311.16733, 2023.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- OpenAI. Chatgpt: Optimizing language models for dialogue, 2022.
- R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Code llama: Open foundation models for code. ArXiv, abs/2308.12950, 2023. URL https://api.semanticscholar.org/CorpusID:261100919.
- Adafactor: Adaptive learning rates with sublinear memory cost. ArXiv, abs/1804.04235, 2018. URL https://api.semanticscholar.org/CorpusID:4786918.
- Pangu-coder2: Boosting large language models for code with ranking feedback, 2023.
- Execution-based code generation using deep reinforcement learning. ArXiv, abs/2301.13816, 2023. URL https://api.semanticscholar.org/CorpusID:256416258.
- Learning performance-improving code edits. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ix7rLVHXyY.
- One embedder, any task: Instruction-finetuned text embeddings. ArXiv, abs/2212.09741, 2022. URL https://api.semanticscholar.org/CorpusID:254853816.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv, abs/2305.03047, 2023. URL https://api.semanticscholar.org/CorpusID:258479665.
- Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023.
- Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment. arXiv preprint arXiv:2402.12275, 2024.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- theblackcat102. The evolved code alpaca dataset. https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1, 2023.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- A survey on data selection for llm instruction tuning. ArXiv, abs/2402.05123, 2024. URL https://api.semanticscholar.org/CorpusID:267547917.
- Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, 2021.
- Codet5+: Open code large language models for code understanding and generation. In Conference on Empirical Methods in Natural Language Processing, 2023b. URL https://api.semanticscholar.org/CorpusID:258685677.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
- Starcoder2-instruct: Fully transparent and permissive self-alignment for code generation, 2024. URL https://github.com/bigcode-project/starcoder2-self-align.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:258841263.
- Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint arXiv:2312.14187, 2023.
- Self-rewarding language models, 2024.
- Automathtext: Autonomous data selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625, 2024.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. ArXiv, abs/2303.17568, 2023. URL https://api.semanticscholar.org/CorpusID:257834177.
- Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024.
- Yutong Wu (25 papers)
- Di Huang (203 papers)
- Wenxuan Shi (7 papers)
- Wei Wang (1793 papers)
- Lingzhe Gao (1 paper)
- Shihao Liu (10 papers)
- Ziyuan Nan (5 papers)
- Kaizhao Yuan (3 papers)
- Rui Zhang (1138 papers)
- Xishan Zhang (22 papers)
- Zidong Du (41 papers)
- Qi Guo (237 papers)
- Yewen Pu (27 papers)
- Dawei Yin (165 papers)
- Xing Hu (122 papers)
- Yunji Chen (51 papers)