Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining (2409.02326v1)

Published 3 Sep 2024 in cs.CL and cs.AI
Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

Abstract: Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of LLMs. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.

An Insightful Overview of "Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining"

Introduction

"Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining" introduces Arctic-SnowCoder-1.3B, a sophisticated small code model that underscores the pivotal role of data quality in pretraining efforts. This paper provides a comprehensive analysis of how progressively refined data can enhance the performance of code models, presenting Arctic-SnowCoder as a landmark in the domain of data-efficient pretraining.

Methodology

The paper proposes a three-phase training methodology:

  1. General Pretraining: The initial phase involves training Arctic-SnowCoder on 500B tokens of raw code data. This data is sourced primarily from cleaned versions of The Stack v1 and GitHub crawls and undergoes basic filtering, deduplication, and decontamination. A key aspect of this phase is the partitioning of code files by programming language, which has been empirically shown to outperform merely grouping files by repository.
  2. Continued Pretraining with High-Quality Data: The next phase refines the data quality by using 50B tokens selected from the initial raw corpus. A BERT-based quality annotator classifies these high-quality tokens by leveraging a combination of high-quality open-source code files and instruction data. The annotator's performance in aligning pretraining data with downstream task distributions is crucial for achieving superior outcomes.
  3. Enhanced Pretraining with Synthetic Data: In the final phase, 5B tokens of synthetic data are generated using Llama-3.1-70B, extending the high-quality data pool. This phase adapts the Magicoder OSS-Instruct methodology to produce high-quality, problem-solving oriented code documents, further enhancing the model performance.

Results

Arctic-SnowCoder-1.3B demonstrates impressive performance across several benchmarks:

  • BigCodeBench: The model achieves state-of-the-art results, outperforming similarly sized models such as Phi-1.5-1.3B by 36%.
  • HumanEval+ and MBPP+: The model matches or surpasses models trained on significantly more extensive datasets, such as StarCoder2-3B and Qwen{1.5}-1.8B.
  • EvoEval: Arctic-SnowCoder remains competitive, showcasing its robustness in diverse practical and challenging programming tasks.

The paper's comprehensive analysis underscores the advantages of three-phase pretraining, revealing consistent improvements across all training stages.

Ablation Studies

The paper includes various ablation studies that validate the design choices:

  • Repo-Level Data Grouping: Organizing file-level data into repositories by programming language significantly improves performance over grouping data by repository names alone.
  • Quality Annotator: The best-performing model-based quality annotator combines high-quality code files with instruction data, highlighting the importance of data alignment with downstream applications.
  • Learning Rate Schedule: A re-warmup learning rate schedule, which ramps up to the maximum pretraining learning rate before linearly decaying, proves to be the most effective.
  • Repetitions of High-Quality Data: Repeating high-quality tokens four times during continued pretraining yields the best overall performance, emphasizing the necessity of optimizing repetitions.

Contributions and Implications

The contributions of this research are manifold:

  • Introduction of Arctic-SnowCoder-1.3B, a high-performing small code model trained on a fraction of the tokens used by other state-of-the-art models.
  • Demonstration that high-quality and synthetic data significantly enhance model performance, even when sourced from the same raw corpus.
  • Detailed analysis and practical insights into optimal design choices for repo-level data grouping, learning rate schedules, and repetitions of high-quality data.

Future Developments

The findings elucidate the significance of data quality in pretraining LLMs, particularly in specialized domains like code. Future research could focus on further refining the methodologies for identifying and generating high-quality synthetic data. Additionally, exploring the integration of Arctic-SnowCoder's training methodologies with larger and more diverse datasets may yield even greater performance improvements.

Conclusion

"Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining" offers a robust framework for understanding and leveraging data quality in pretraining code models. The phased approach to data refinement not only enhances performance but also provides a blueprint for future advancements in AI-driven code generation. The paper stands as a testament to the importance of aligning pretraining data distributions with downstream task requirements, paving the way for more efficient and effective model development in the field of AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
  2. Smollm - blazingly fast and remarkably powerful. https://huggingface.co/blog/smollm, 2024.
  3. To code, or not to code? exploring impact of code in pre-training, 2024.
  4. Program synthesis with large language models, 2021.
  5. Enriching word vectors with subword information, 2017.
  6. Andrew P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159, 1997.
  7. Evaluating large language models trained on code, 2021.
  8. Software heritage: Why and how to preserve software source code. In iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 2017. https://hal.archives-ouvertes.fr/hal-01590958.
  9. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
  10. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence, 2024.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  12. The llama 3 herd of models, 2024.
  13. Textbooks are all you need, 2023.
  14. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
  15. Mixtral of experts, 2024.
  16. Adam: A method for stochastic optimization, 2017.
  17. The stack: 3 tb of permissively licensed source code, 2022.
  18. Datacomp-lm: In search of the next generation of training sets for language models, 2024.
  19. Starcoder: may the source be with you!, 2023.
  20. Textbooks are all you need ii: phi-1.5 technical report, 2023.
  21. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 21558–21572. Curran Associates, Inc., 2023.
  22. Roberta: A robustly optimized bert pretraining approach, 2019.
  23. Starcoder 2 and the stack v2: The next generation, 2024.
  24. At which training stage does code data help LLMs reasoning? In The Twelfth International Conference on Learning Representations, 2024.
  25. Language models of code are few-shot commonsense learners. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  26. Arctic-embed: Scalable, efficient, and accurate text embedding models, 2024.
  27. Granite code models: A family of open foundation models for code intelligence, 2024.
  28. Codegen: An open large language model for code with multi-turn program synthesis. In International Conference on Learning Representations, 2023.
  29. OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022.
  30. The fineweb datasets: Decanting the web for the finest text data at scale, 2024.
  31. Stable code technical report, 2024.
  32. Snowflake AI Research. Snowflake arctic: The best llm for enterprise ai — efficiently intelligent, truly open, 2024.
  33. Code llama: Open foundation models for code, 2024.
  34. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.
  35. Roformer: Enhanced transformer with rotary position embedding, 2023.
  36. Codegemma: Open code models based on gemma, 2024.
  37. Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants. https://huggingface.co/datasets/teknium/OpenHermes2.5, 2023.
  38. Llama 2: Open foundation and fine-tuned chat models, 2023.
  39. Yuxiang Wei. hqcode. https://huggingface.co/datasets/yuxiang630/hqcode, 2024.
  40. Starcoder2-instruct: Fully transparent and permissive self-alignment for code generation. https://huggingface.co/blog/sc2-instruct, 2024.
  41. Magicoder: Empowering code generation with OSS-instruct. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 52632–52657. PMLR, 21–27 Jul 2024.
  42. Wikipedia contributors. Plagiarism — Wikipedia, the free encyclopedia, 2004. [Online; accessed 22-July-2004].
  43. Top leaderboard ranking = top coding proficiency, always? evoeval: Evolving coding benchmarks via llm, 2024.
  44. Qwen2 technical report, 2024.
  45. If LLM is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024.
  46. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yuxiang Wei (40 papers)
  2. Hojae Han (5 papers)
  3. Rajhans Samdani (4 papers)