Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mastering the Craft of Data Synthesis for CodeLLMs

Published 16 Oct 2024 in cs.SE and cs.AI | (2411.00005v3)

Abstract: LLMs have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (119)
  1. A survey on data selection for language models. Transactions on Machine Learning Research. Survey Certification.
  2. Efficient online data mixing for language model pre-training. In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models.
  3. A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51(4).
  4. Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR), pages 207–216.
  5. Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805.
  6. Anthropic. 2023. Claude: the claude model family. https://www.anthropic.com/claude. Accessed: October 7, 2024.
  7. Program synthesis with large language models. Preprint, arXiv:2108.07732.
  8. Qwen technical report. arXiv preprint arXiv:2309.16609.
  9. Benchmarking and improving text-to-SQL generation under ambiguity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7053–7074, Singapore. Association for Computational Linguistics.
  10. A. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of Sequences, International Conference on, page 21, Los Alamitos, CA, USA. IEEE Computer Society.
  11. Beyond code: Evaluate thought steps for complex code generation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2296–2306, Torino, Italia. ELRA and ICCL.
  12. Instruction mining: Instruction data selection for tuning large language models. Preprint, arXiv:2307.06290.
  13. Knowledge transfer from high-resource to low-resource programming languages for code llms. Preprint, arXiv:2308.09895.
  14. Dr.spider: A diagnostic evaluation benchmark towards text-to-SQL robustness. In The Eleventh International Conference on Learning Representations.
  15. Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca.
  16. Personalized distillation: Empowering open-sourced LLMs with adaptive learning for code generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6737–6749, Singapore. Association for Computational Linguistics.
  17. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. Preprint, arXiv:2305.09246.
  18. Alpagasus: Training a better alpaca with fewer data. In The Twelfth International Conference on Learning Representations.
  19. Evaluating large language models trained on code. Preprint, arXiv:2107.03374.
  20. Pinzhen Chen and Gerasimos Lampouras. 2023. Exploring data augmentation for code generation tasks. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1542–1550, Dubrovnik, Croatia. Association for Computational Linguistics.
  21. Instruction pre-training: Language models are supervised multitask learners. Preprint, arXiv:2406.14491.
  22. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
  23. CodeExp: Explanatory code document generation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2342–2354, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  24. CursorAI. 2024. Cursor: the ai code editor. https://www.cursor.com/features. Accessed: October 7, 2024.
  25. Data augmentation using LLMs: Data perspectives, learning paradigms and challenges. In Findings of the Association for Computational Linguistics ACL 2024, pages 1679–1705, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  26. Cycle: Learning to self-refine the code generation. Preprint, arXiv:2403.18746.
  27. Semcoder: Training code language models with comprehensive semantics. Preprint, arXiv:2406.01006.
  28. The llama 3 herd of models. Preprint, arXiv:2407.21783.
  29. Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135.
  30. Better synthetic data by retrieving and transforming existing datasets. In Findings of the Association for Computational Linguistics ACL 2024, pages 6453–6466, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  31. The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
  32. Automatic unit test data generation and actor-critic reinforcement learning for code synthesis. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 370–384, Singapore. Association for Computational Linguistics.
  33. Cruxeval: A benchmark for code reasoning, understanding and execution. Preprint, arXiv:2401.03065.
  34. Textbooks are all you need.
  35. Longcoder: A long-range pre-trained language model for code completion. Preprint, arXiv:2306.14893.
  36. Deepseek-coder: When the large language model meets programming – the rise of code intelligence. Preprint, arXiv:2401.14196.
  37. Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations.
  38. Measuring coding challenge competence with APPS. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  39. Human feedback is not gold standard. In The Twelfth International Conference on Learning Representations.
  40. Andrew Hunt and David Thomas. 2000. The pragmatic programmer: from journeyman to master. Addison-Wesley Longman Publishing Co., Inc., USA.
  41. LLM-assisted code cleaning for training accurate code generators. In The Twelfth International Conference on Learning Representations.
  42. A survey on large language models for code generation. Preprint, arXiv:2406.00515.
  43. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain. Association for Computational Linguistics.
  44. Junaed Younus Khan and Gias Uddin. 2022. Automatic code documentation generation using gpt-3. Preprint, arXiv:2209.02235.
  45. The stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research.
  46. Ds-1000: A natural and reliable benchmark for data science code generation. Preprint, arXiv:2211.11501.
  47. The bigscience roots corpus: A 1.6tb composite multilingual dataset. In Advances in Neural Information Processing Systems, volume 35, pages 31809–31826. Curran Associates, Inc.
  48. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pages 21314–21328. Curran Associates, Inc.
  49. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
  50. Autocoder: Enhancing code large language model with AIEV-Instruct. Preprint, arXiv:2405.14906.
  51. Superfiltering: Weak-to-strong data filtering for fast instruction-tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14255–14273, Bangkok, Thailand. Association for Computational Linguistics.
  52. Starcoder: may the source be with you! Preprint, arXiv:2305.06161.
  53. Textbooks are all you need ii: phi-1.5 technical report. Preprint, arXiv:2309.05463.
  54. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  55. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
  56. Starcoder2-instruct: Fully transparent and permissive self-alignment for code generation. Preprint, arXiv:2307.08701.
  57. Best practices and lessons learned on synthetic data. Preprint, arXiv:2404.07503.
  58. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. In The Twelfth International Conference on Learning Representations.
  59. A survey of nl2sql with large language models: Where are we, and where are we going? Preprint, arXiv:2408.05109.
  60. Uncovering and quantifying social biases in code generation. In Advances in Neural Information Processing Systems, volume 36, pages 2368–2380. Curran Associates, Inc.
  61. Roberta: A robustly optimized bert pretraining approach. Preprint, arXiv:1907.11692.
  62. On LLMs-driven synthetic data generation, curation, and evaluation: A survey. In Findings of the Association for Computational Linguistics ACL 2024, pages 11065–11082, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  63. Starcoder 2 and the stack v2: The next generation. Preprint, arXiv:2402.19173.
  64. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292.
  65. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models. In The Twelfth International Conference on Learning Representations.
  66. Wizardcoder: Empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations.
  67. Automatic programming: Large language models and beyond. Preprint, arXiv:2405.02213.
  68. Llm and simulation as bilevel optimizers: A new paradigm to advance physical scientific discovery. In International Conference on Machine Learning (ICML).
  69. Dolos: Language-agnostic plagiarism detection in source code. Journal of Computer Assisted Learning, 38(4):1046–1061.
  70. Synthetic programming elicitation and repair for text-to-code in very low-resource programming languages. Preprint, arXiv:2406.03636.
  71. Is self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning Representations.
  72. OpenAI. 2023. GPT-3.5 turbo fine-tuning and api updates. https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates/. Accessed: June 7, 2024.
  73. Training language models on synthetic edit sequences improves code synthesis. Preprint, arXiv:2410.02749.
  74. Quantifying contamination in evaluating code generation capabilities of language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14116–14137, Bangkok, Thailand. Association for Computational Linguistics.
  75. Code llama: Open foundation models for code. Preprint, arXiv:2308.12950.
  76. Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations.
  77. Case2code: Learning inductive reasoning with synthetic data. Preprint, arXiv:2407.12504.
  78. Pitfalls in language models for code intelligence: A taxonomy and survey. Preprint, arXiv:2310.17903.
  79. Slimpajama-dc: Understanding data combinations for llm training. Preprint, arXiv:2309.10818.
  80. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  81. Learning performance-improving code edits. In The Twelfth International Conference on Learning Representations.
  82. Can LLMs generate novel research ideas? a large-scale human study with 100+ nlp researchers. arXiv preprint arXiv:2409.04109.
  83. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
  84. Beyond human data: Scaling self-training for problem-solving with language models. Transactions on Machine Learning Research. Expert Certification.
  85. Chia-Yi Su and Collin McMillan. 2024. Distilled gpt for source code summarization. Preprint, arXiv:2308.14731.
  86. A survey of neural code intelligence: Paradigms, advances and beyond. Preprint, arXiv:2403.14734.
  87. Enhancing code generation performance of smaller models by distilling the reasoning ability of LLMs. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5878–5895, Torino, Italia. ELRA and ICCL.
  88. Code translation with compiler representations. In The Eleventh International Conference on Learning Representations.
  89. Large language models for data annotation: A survey. Preprint, arXiv:2402.13446.
  90. DebugBench: Evaluating debugging capability of large language models. In Findings of the Association for Computational Linguistics ACL 2024, pages 4173–4198, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  91. Frustrated with code quality issues? llms can help! Preprint, arXiv:2309.12938.
  92. Deep learning for code intelligence: Survey, benchmark and toolkit. Preprint, arXiv:2401.00288.
  93. A survey on data selection for llm instruction tuning. Preprint, arXiv:2402.05123.
  94. Compilable neural code generation with compiler feedback. In Findings of the Association for Computational Linguistics: ACL 2022, pages 9–19, Dublin, Ireland. Association for Computational Linguistics.
  95. LETI: Learning to generate from textual interactions. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 223–239, Mexico City, Mexico. Association for Computational Linguistics.
  96. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  97. Data management for training large language models: A survey. Preprint, arXiv:2312.01700.
  98. Magicoder: Empowering code generation with OSS-instruct. In Forty-first International Conference on Machine Learning.
  99. Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. Preprint, arXiv:2403.09032.
  100. Distilrr: Transferring code repair for low-resource programming languages. Preprint, arXiv:2406.14867.
  101. Versicode: Towards version-controllable code generation. arXiv preprint arXiv:2406.07411.
  102. Doremi: Optimizing data mixtures speeds up language model pretraining. In Thirty-seventh Conference on Neural Information Processing Systems.
  103. WizardLM: Empowering large pre-trained language models to follow complex instructions. In The Twelfth International Conference on Learning Representations.
  104. A survey on knowledge distillation of large language models. Preprint, arXiv:2402.13116.
  105. Synthesizing text-to-SQL data from weak and strong LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7864–7875, Bangkok, Thailand. Association for Computational Linguistics.
  106. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211.
  107. WaveCoder: Widespread and versatile enhancement for code large language models by instruction tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5140–5153, Bangkok, Thailand. Association for Computational Linguistics.
  108. Large language models meet NL2Code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, Toronto, Canada. Association for Computational Linguistics.
  109. Automatic instruction evolving for large language models. Preprint, arXiv:2406.00770.
  110. Plum: Preference learning plus test cases yields better code language models. Preprint, arXiv:2406.06887.
  111. A systematic literature review on large language models for automated program repair. Preprint, arXiv:2405.01466.
  112. A survey on large language models for software engineering. Preprint, arXiv:2312.15223.
  113. Sciencebenchmark: A complex real-world benchmark for evaluating natural language to sql systems. Preprint, arXiv:2306.04743.
  114. Unifying the perspectives of nlp and software engineering: A survey on language models for code. Preprint, arXiv:2311.07989.
  115. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673–5684.
  116. OpenCodeInterpreter: Integrating code generation with execution and refinement. In Findings of the Association for Computational Linguistics ACL 2024, pages 12834–12859, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  117. CodeBERTScore: Evaluating code generation with pretrained models of code. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13921–13937, Singapore. Association for Computational Linguistics.
  118. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. Preprint, arXiv:2406.11931.
  119. Terry Yue Zhuo. 2024. ICE-score: Instructing large language models to evaluate code. In Findings of the Association for Computational Linguistics: EACL 2024, pages 2232–2242, St. Julian’s, Malta. Association for Computational Linguistics.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

HackerNews