2000 character limit reached
Mastering the Craft of Data Synthesis for CodeLLMs
Published 16 Oct 2024 in cs.SE and cs.AI | (2411.00005v3)
Abstract: LLMs have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.
- A survey on data selection for language models. Transactions on Machine Learning Research. Survey Certification.
- Efficient online data mixing for language model pre-training. In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models.
- A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51(4).
- Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR), pages 207–216.
- Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805.
- Anthropic. 2023. Claude: the claude model family. https://www.anthropic.com/claude. Accessed: October 7, 2024.
- Program synthesis with large language models. Preprint, arXiv:2108.07732.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Benchmarking and improving text-to-SQL generation under ambiguity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7053–7074, Singapore. Association for Computational Linguistics.
- A. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of Sequences, International Conference on, page 21, Los Alamitos, CA, USA. IEEE Computer Society.
- Beyond code: Evaluate thought steps for complex code generation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2296–2306, Torino, Italia. ELRA and ICCL.
- Instruction mining: Instruction data selection for tuning large language models. Preprint, arXiv:2307.06290.
- Knowledge transfer from high-resource to low-resource programming languages for code llms. Preprint, arXiv:2308.09895.
- Dr.spider: A diagnostic evaluation benchmark towards text-to-SQL robustness. In The Eleventh International Conference on Learning Representations.
- Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca.
- Personalized distillation: Empowering open-sourced LLMs with adaptive learning for code generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6737–6749, Singapore. Association for Computational Linguistics.
- Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. Preprint, arXiv:2305.09246.
- Alpagasus: Training a better alpaca with fewer data. In The Twelfth International Conference on Learning Representations.
- Evaluating large language models trained on code. Preprint, arXiv:2107.03374.
- Pinzhen Chen and Gerasimos Lampouras. 2023. Exploring data augmentation for code generation tasks. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1542–1550, Dubrovnik, Croatia. Association for Computational Linguistics.
- Instruction pre-training: Language models are supervised multitask learners. Preprint, arXiv:2406.14491.
- Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
- CodeExp: Explanatory code document generation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2342–2354, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- CursorAI. 2024. Cursor: the ai code editor. https://www.cursor.com/features. Accessed: October 7, 2024.
- Data augmentation using LLMs: Data perspectives, learning paradigms and challenges. In Findings of the Association for Computational Linguistics ACL 2024, pages 1679–1705, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Cycle: Learning to self-refine the code generation. Preprint, arXiv:2403.18746.
- Semcoder: Training code language models with comprehensive semantics. Preprint, arXiv:2406.01006.
- The llama 3 herd of models. Preprint, arXiv:2407.21783.
- Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135.
- Better synthetic data by retrieving and transforming existing datasets. In Findings of the Association for Computational Linguistics ACL 2024, pages 6453–6466, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
- Automatic unit test data generation and actor-critic reinforcement learning for code synthesis. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 370–384, Singapore. Association for Computational Linguistics.
- Cruxeval: A benchmark for code reasoning, understanding and execution. Preprint, arXiv:2401.03065.
- Textbooks are all you need.
- Longcoder: A long-range pre-trained language model for code completion. Preprint, arXiv:2306.14893.
- Deepseek-coder: When the large language model meets programming – the rise of code intelligence. Preprint, arXiv:2401.14196.
- Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations.
- Measuring coding challenge competence with APPS. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Human feedback is not gold standard. In The Twelfth International Conference on Learning Representations.
- Andrew Hunt and David Thomas. 2000. The pragmatic programmer: from journeyman to master. Addison-Wesley Longman Publishing Co., Inc., USA.
- LLM-assisted code cleaning for training accurate code generators. In The Twelfth International Conference on Learning Representations.
- A survey on large language models for code generation. Preprint, arXiv:2406.00515.
- Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain. Association for Computational Linguistics.
- Junaed Younus Khan and Gias Uddin. 2022. Automatic code documentation generation using gpt-3. Preprint, arXiv:2209.02235.
- The stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research.
- Ds-1000: A natural and reliable benchmark for data science code generation. Preprint, arXiv:2211.11501.
- The bigscience roots corpus: A 1.6tb composite multilingual dataset. In Advances in Neural Information Processing Systems, volume 35, pages 31809–31826. Curran Associates, Inc.
- Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pages 21314–21328. Curran Associates, Inc.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
- Autocoder: Enhancing code large language model with AIEV-Instruct. Preprint, arXiv:2405.14906.
- Superfiltering: Weak-to-strong data filtering for fast instruction-tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14255–14273, Bangkok, Thailand. Association for Computational Linguistics.
- Starcoder: may the source be with you! Preprint, arXiv:2305.06161.
- Textbooks are all you need ii: phi-1.5 technical report. Preprint, arXiv:2309.05463.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Starcoder2-instruct: Fully transparent and permissive self-alignment for code generation. Preprint, arXiv:2307.08701.
- Best practices and lessons learned on synthetic data. Preprint, arXiv:2404.07503.
- What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. In The Twelfth International Conference on Learning Representations.
- A survey of nl2sql with large language models: Where are we, and where are we going? Preprint, arXiv:2408.05109.
- Uncovering and quantifying social biases in code generation. In Advances in Neural Information Processing Systems, volume 36, pages 2368–2380. Curran Associates, Inc.
- Roberta: A robustly optimized bert pretraining approach. Preprint, arXiv:1907.11692.
- On LLMs-driven synthetic data generation, curation, and evaluation: A survey. In Findings of the Association for Computational Linguistics ACL 2024, pages 11065–11082, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Starcoder 2 and the stack v2: The next generation. Preprint, arXiv:2402.19173.
- The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292.
- #instag: Instruction tagging for analyzing supervised fine-tuning of large language models. In The Twelfth International Conference on Learning Representations.
- Wizardcoder: Empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations.
- Automatic programming: Large language models and beyond. Preprint, arXiv:2405.02213.
- Llm and simulation as bilevel optimizers: A new paradigm to advance physical scientific discovery. In International Conference on Machine Learning (ICML).
- Dolos: Language-agnostic plagiarism detection in source code. Journal of Computer Assisted Learning, 38(4):1046–1061.
- Synthetic programming elicitation and repair for text-to-code in very low-resource programming languages. Preprint, arXiv:2406.03636.
- Is self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning Representations.
- OpenAI. 2023. GPT-3.5 turbo fine-tuning and api updates. https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates/. Accessed: June 7, 2024.
- Training language models on synthetic edit sequences improves code synthesis. Preprint, arXiv:2410.02749.
- Quantifying contamination in evaluating code generation capabilities of language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14116–14137, Bangkok, Thailand. Association for Computational Linguistics.
- Code llama: Open foundation models for code. Preprint, arXiv:2308.12950.
- Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations.
- Case2code: Learning inductive reasoning with synthetic data. Preprint, arXiv:2407.12504.
- Pitfalls in language models for code intelligence: A taxonomy and survey. Preprint, arXiv:2310.17903.
- Slimpajama-dc: Understanding data combinations for llm training. Preprint, arXiv:2309.10818.
- Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Learning performance-improving code edits. In The Twelfth International Conference on Learning Representations.
- Can LLMs generate novel research ideas? a large-scale human study with 100+ nlp researchers. arXiv preprint arXiv:2409.04109.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
- Beyond human data: Scaling self-training for problem-solving with language models. Transactions on Machine Learning Research. Expert Certification.
- Chia-Yi Su and Collin McMillan. 2024. Distilled gpt for source code summarization. Preprint, arXiv:2308.14731.
- A survey of neural code intelligence: Paradigms, advances and beyond. Preprint, arXiv:2403.14734.
- Enhancing code generation performance of smaller models by distilling the reasoning ability of LLMs. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5878–5895, Torino, Italia. ELRA and ICCL.
- Code translation with compiler representations. In The Eleventh International Conference on Learning Representations.
- Large language models for data annotation: A survey. Preprint, arXiv:2402.13446.
- DebugBench: Evaluating debugging capability of large language models. In Findings of the Association for Computational Linguistics ACL 2024, pages 4173–4198, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Frustrated with code quality issues? llms can help! Preprint, arXiv:2309.12938.
- Deep learning for code intelligence: Survey, benchmark and toolkit. Preprint, arXiv:2401.00288.
- A survey on data selection for llm instruction tuning. Preprint, arXiv:2402.05123.
- Compilable neural code generation with compiler feedback. In Findings of the Association for Computational Linguistics: ACL 2022, pages 9–19, Dublin, Ireland. Association for Computational Linguistics.
- LETI: Learning to generate from textual interactions. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 223–239, Mexico City, Mexico. Association for Computational Linguistics.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
- Data management for training large language models: A survey. Preprint, arXiv:2312.01700.
- Magicoder: Empowering code generation with OSS-instruct. In Forty-first International Conference on Machine Learning.
- Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. Preprint, arXiv:2403.09032.
- Distilrr: Transferring code repair for low-resource programming languages. Preprint, arXiv:2406.14867.
- Versicode: Towards version-controllable code generation. arXiv preprint arXiv:2406.07411.
- Doremi: Optimizing data mixtures speeds up language model pretraining. In Thirty-seventh Conference on Neural Information Processing Systems.
- WizardLM: Empowering large pre-trained language models to follow complex instructions. In The Twelfth International Conference on Learning Representations.
- A survey on knowledge distillation of large language models. Preprint, arXiv:2402.13116.
- Synthesizing text-to-SQL data from weak and strong LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7864–7875, Bangkok, Thailand. Association for Computational Linguistics.
- A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211.
- WaveCoder: Widespread and versatile enhancement for code large language models by instruction tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5140–5153, Bangkok, Thailand. Association for Computational Linguistics.
- Large language models meet NL2Code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, Toronto, Canada. Association for Computational Linguistics.
- Automatic instruction evolving for large language models. Preprint, arXiv:2406.00770.
- Plum: Preference learning plus test cases yields better code language models. Preprint, arXiv:2406.06887.
- A systematic literature review on large language models for automated program repair. Preprint, arXiv:2405.01466.
- A survey on large language models for software engineering. Preprint, arXiv:2312.15223.
- Sciencebenchmark: A complex real-world benchmark for evaluating natural language to sql systems. Preprint, arXiv:2306.04743.
- Unifying the perspectives of nlp and software engineering: A survey on language models for code. Preprint, arXiv:2311.07989.
- Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673–5684.
- OpenCodeInterpreter: Integrating code generation with execution and refinement. In Findings of the Association for Computational Linguistics ACL 2024, pages 12834–12859, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- CodeBERTScore: Evaluating code generation with pretrained models of code. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13921–13937, Singapore. Association for Computational Linguistics.
- Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. Preprint, arXiv:2406.11931.
- Terry Yue Zhuo. 2024. ICE-score: Instructing large language models to evaluate code. In Findings of the Association for Computational Linguistics: EACL 2024, pages 2232–2242, St. Julian’s, Malta. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.