Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining (2412.15285v1)

Published 18 Dec 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Pretraining LLMs effectively requires strategic data selection, blending and ordering. However, key details about data mixtures especially their scalability to longer token horizons and larger model sizes remain underexplored due to limited disclosure by model developers. To address this, we formalize the concept of two-phase pretraining and conduct an extensive systematic study on how to select and mix data to maximize model accuracies for the two phases. Our findings illustrate that a two-phase approach for pretraining outperforms random data ordering and natural distribution of tokens by 3.4% and 17% on average accuracies. We provide in-depth guidance on crafting optimal blends based on quality of the data source and the number of epochs to be seen. We propose to design blends using downsampled data at a smaller scale of 1T tokens and then demonstrate effective scaling of our approach to larger token horizon of 15T tokens and larger model size of 25B model size. These insights provide a series of steps practitioners can follow to design and scale their data blends.

PDF Abstract

Enhancing LLM Accuracy with Two-Phase Pretraining

The paper entitled "Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining" advances the understanding and methodology of pretraining LLMs by introducing a structured two-phase pretraining strategy. This approach is particularly insightful, as current literature and industry practices often lack specificity on optimal data mixtures and strategic ordering during pretraining, which are crucial for optimizing model accuracy, especially for LLMs operating on large-scale token horizons.

Methodology

The authors propose a systematic exploration of a two-phase training approach. This involves an initial phase that prioritizes a diverse mixture of data sources, predominantly comprising high-quality web crawl data, and a subsequent phase focusing on a strategic blend emphasizing high-quality datasets, including math, code, and high-grade web resources. This two-phased strategy is validated through extensive empirical testing across scales—from 1 trillion to 15 trillion tokens and model parameterization extending to 25 billion parameters.

Key Findings

Two-Phase Pretraining Superiority: The two-phase method surpassed traditional data ordering techniques, displaying a notable improvement in benchmark performances—reporting average accuracy enhancements of 3.4% and significant gains of 17% in model precision compared to natural data distribution approaches.
Strategic Data Blending: The integration of high-quality data during the second pretraining phase proved beneficial, sharpening model performance, notably in reasoning and coding benchmarks such as GSM8K and code synthesis tasks.
Scaling Insights: The systematic use of downsampled data to determine optimal data blend ratios at a 1 trillion token scale enabled effective transfer and scaling to a 15 trillion token horizon. This demonstrates the robustness and adaptability of the two-phase approach across different model dimensions.

Practical Implications

The insights from this paper provide a framework for practitioners to create scalable and data-efficient pretraining pipelines. By optimizing data quality and sequencing, this method not only enhances model performance across general and specific tasks but also informs decisions around data curation and preprocessing—a critical step in LLM development.

Theoretical Implications and Future Directions

Theoretical advancements lie in understanding how data quality and training sequences affect LLM internal representations and generalization capabilities. While this paper primarily focused on text data, the principles outlined could be extended to multimodal models, examining how different types of discourses—multilingual, domain-specific, or task-oriented—could benefit from similar pretraining structuring.

The possibility of further refining multi-phase approaches or even exploring task-specific phases could yield models better equipped for niche applications, such as domain-specific models in healthcare or law. This paper paves the way for future explorations into adaptive pretraining strategies, presenting opportunities for researchers to investigate the granular effect of data characteristics on model behaviors and ethical outcomes in AI deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Steven Feng (1 paper)
Shrimai Prabhumoye (40 papers)
Kezhi Kong (12 papers)
Dan Su (101 papers)
Mostofa Patwary (34 papers)
Mohammad Shoeybi (60 papers)
Bryan Catanzaro (123 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1871311245149979120