Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining (2409.02326v1)

Published 3 Sep 2024 in cs.CL and cs.AI

Abstract: Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of LLMs. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.

PDF Abstract

An Insightful Overview of "Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining"

Introduction

"Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining" introduces Arctic-SnowCoder-1.3B, a sophisticated small code model that underscores the pivotal role of data quality in pretraining efforts. This paper provides a comprehensive analysis of how progressively refined data can enhance the performance of code models, presenting Arctic-SnowCoder as a landmark in the domain of data-efficient pretraining.

Methodology

The paper proposes a three-phase training methodology:

General Pretraining: The initial phase involves training Arctic-SnowCoder on 500B tokens of raw code data. This data is sourced primarily from cleaned versions of The Stack v1 and GitHub crawls and undergoes basic filtering, deduplication, and decontamination. A key aspect of this phase is the partitioning of code files by programming language, which has been empirically shown to outperform merely grouping files by repository.
Continued Pretraining with High-Quality Data: The next phase refines the data quality by using 50B tokens selected from the initial raw corpus. A BERT-based quality annotator classifies these high-quality tokens by leveraging a combination of high-quality open-source code files and instruction data. The annotator's performance in aligning pretraining data with downstream task distributions is crucial for achieving superior outcomes.
Enhanced Pretraining with Synthetic Data: In the final phase, 5B tokens of synthetic data are generated using Llama-3.1-70B, extending the high-quality data pool. This phase adapts the Magicoder OSS-Instruct methodology to produce high-quality, problem-solving oriented code documents, further enhancing the model performance.

Results

Arctic-SnowCoder-1.3B demonstrates impressive performance across several benchmarks:

BigCodeBench: The model achieves state-of-the-art results, outperforming similarly sized models such as Phi-1.5-1.3B by 36%.
HumanEval+ and MBPP+: The model matches or surpasses models trained on significantly more extensive datasets, such as StarCoder2-3B and Qwen{1.5}-1.8B.
EvoEval: Arctic-SnowCoder remains competitive, showcasing its robustness in diverse practical and challenging programming tasks.

The paper's comprehensive analysis underscores the advantages of three-phase pretraining, revealing consistent improvements across all training stages.

Ablation Studies

The paper includes various ablation studies that validate the design choices:

Repo-Level Data Grouping: Organizing file-level data into repositories by programming language significantly improves performance over grouping data by repository names alone.
Quality Annotator: The best-performing model-based quality annotator combines high-quality code files with instruction data, highlighting the importance of data alignment with downstream applications.
Learning Rate Schedule: A re-warmup learning rate schedule, which ramps up to the maximum pretraining learning rate before linearly decaying, proves to be the most effective.
Repetitions of High-Quality Data: Repeating high-quality tokens four times during continued pretraining yields the best overall performance, emphasizing the necessity of optimizing repetitions.

Contributions and Implications

The contributions of this research are manifold:

Introduction of Arctic-SnowCoder-1.3B, a high-performing small code model trained on a fraction of the tokens used by other state-of-the-art models.
Demonstration that high-quality and synthetic data significantly enhance model performance, even when sourced from the same raw corpus.
Detailed analysis and practical insights into optimal design choices for repo-level data grouping, learning rate schedules, and repetitions of high-quality data.

Future Developments

The findings elucidate the significance of data quality in pretraining LLMs, particularly in specialized domains like code. Future research could focus on further refining the methodologies for identifying and generating high-quality synthetic data. Additionally, exploring the integration of Arctic-SnowCoder's training methodologies with larger and more diverse datasets may yield even greater performance improvements.

Conclusion

"Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining" offers a robust framework for understanding and leveraging data quality in pretraining code models. The phased approach to data refinement not only enhances performance but also provides a blueprint for future advancements in AI-driven code generation. The paper stands as a testament to the importance of aligning pretraining data distributions with downstream task requirements, paving the way for more efficient and effective model development in the field of AI.

PDF Markdown Bookmark Chat (Pro)

References (46)

Authors (3)

Yuxiang Wei (40 papers)
Hojae Han (5 papers)
Rajhans Samdani (4 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/YuxiangWei9/status/1831823166236455216

https://twitter.com/fly51fly/status/1831806812557078708

https://twitter.com/javaeeeee1/status/1832766986100633738

https://twitter.com/arXivGPT/status/1832161873346592844

https://twitter.com/sampenrose/status/1832072774099337377