Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
Abstract: Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of LLMs. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.
- Phi-3 technical report: A highly capable language model locally on your phone, 2024.
- Smollm - blazingly fast and remarkably powerful. https://huggingface.co/blog/smollm, 2024.
- To code, or not to code? exploring impact of code in pre-training, 2024.
- Program synthesis with large language models, 2021.
- Enriching word vectors with subword information, 2017.
- Andrew P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159, 1997.
- Evaluating large language models trained on code, 2021.
- Software heritage: Why and how to preserve software source code. In iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 2017. https://hal.archives-ouvertes.fr/hal-01590958.
- Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
- Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence, 2024.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- The llama 3 herd of models, 2024.
- Textbooks are all you need, 2023.
- Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
- Mixtral of experts, 2024.
- Adam: A method for stochastic optimization, 2017.
- The stack: 3 tb of permissively licensed source code, 2022.
- Datacomp-lm: In search of the next generation of training sets for language models, 2024.
- Starcoder: may the source be with you!, 2023.
- Textbooks are all you need ii: phi-1.5 technical report, 2023.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 21558–21572. Curran Associates, Inc., 2023.
- Roberta: A robustly optimized bert pretraining approach, 2019.
- Starcoder 2 and the stack v2: The next generation, 2024.
- At which training stage does code data help LLMs reasoning? In The Twelfth International Conference on Learning Representations, 2024.
- Language models of code are few-shot commonsense learners. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- Arctic-embed: Scalable, efficient, and accurate text embedding models, 2024.
- Granite code models: A family of open foundation models for code intelligence, 2024.
- Codegen: An open large language model for code with multi-turn program synthesis. In International Conference on Learning Representations, 2023.
- OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022.
- The fineweb datasets: Decanting the web for the finest text data at scale, 2024.
- Stable code technical report, 2024.
- Snowflake AI Research. Snowflake arctic: The best llm for enterprise ai — efficiently intelligent, truly open, 2024.
- Code llama: Open foundation models for code, 2024.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.
- Roformer: Enhanced transformer with rotary position embedding, 2023.
- Codegemma: Open code models based on gemma, 2024.
- Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants. https://huggingface.co/datasets/teknium/OpenHermes2.5, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Yuxiang Wei. hqcode. https://huggingface.co/datasets/yuxiang630/hqcode, 2024.
- Starcoder2-instruct: Fully transparent and permissive self-alignment for code generation. https://huggingface.co/blog/sc2-instruct, 2024.
- Magicoder: Empowering code generation with OSS-instruct. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 52632–52657. PMLR, 21–27 Jul 2024.
- Wikipedia contributors. Plagiarism — Wikipedia, the free encyclopedia, 2004. [Online; accessed 22-July-2004].
- Top leaderboard ranking = top coding proficiency, always? evoeval: Evolving coding benchmarks via llm, 2024.
- Qwen2 technical report, 2024.
- If LLM is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024.
- Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.