Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset (2412.02595v1)

Published 3 Dec 2024 in cs.CL

Abstract: Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7.2T came from our dataset, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks. The dataset is available at https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html

PDF HTML Abstract

Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset with Nemotron-CC

The paper "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset" ventures into the development of a more efficient pretraining dataset for LLMs by refining the Common Crawl dataset. The authors propose an innovative methodology that leverages classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters to transform Common Crawl data into a high-quality, large-scale pretraining dataset for long-horizon training.

Key Contributions

Efficient Data Transformation: The authors introduce a new method that enhances the quality of Common Crawl data without excessively trimming the data volume. This method involves combining multiple classifier models to label the data, generating synthetic data to rephrase existing content, and minimizing heuristic filter application. As a result, a 6.3 trillion (T) token dataset is developed, consisting of 4.4T globally deduplicated tokens and 1.9T synthetically generated tokens, which marks a notable advancement over prior datasets like DCLM and FineWeb-Edu.
Performance Evaluation: The quality improvements are quantitatively evaluated through the training of 8 billion (B) parameter transformer models. When trained with 1T token datasets, the proposed high-quality subset contributes to a 5.6-point improvement in MMLU over DCLM, signifying substantial performance gains. Additionally, their comprehensive 6.3T token dataset holds par with DCLM for accuracy while containing four times more unique real tokens.
Long-Horizon Training: The paper further explores the potential of the dataset for long-horizon pretraining, demonstrating that models trained with their dataset attain higher scores than existing models like Llama 3.1 trained on equivalent amounts of data (15T tokens). Specifically, the Nemotron-CC dataset leads to +5 points on MMLU, +3.1 on ARC-Challenge, and enhanced performance across ten diverse tasks, indicating the prowess of the dataset in prolonged training environments.

Methodological Insights

The efficacy of "Nemotron-CC" is attributed to several primary methodological innovations:

Classifier Ensembling: Integrating different classifiers, including the FineWeb-Edu and DCLM classifiers, alongside custom classifiers trained with model-based labels, significantly improves the recall and identification of high-quality tokens.
Synthetic Data Generation: By rephrasing low- and high-quality data, the paper enhances dataset diversity and quality, contributing to reduced model training perplexity and improved benchmark performance.
Filter Optimization: Reducing reliance on non-learned, heuristic filters boosts the retention of high-quality data, which otherwise would be discarded, optimizing the balance between data quality and yield.

Implications and Future Considerations

The presented methodologies exemplify an advanced approach to data collection and preparation for LLM pretraining, addressing the common dilemma of data quality versus quantity. Critically, the paper suggests that combining synthetic data, more sophisticated filtering methods, and classifier ensembling heralds a promising direction for developing datasets that support extensive token horizon training.

Future research could focus on refining the current approach by diversifying the ensemble methods further and extending these methods to non-English datasets, given the significant role of multilingual model training. Additionally, verifying the factual integrity of rephrased data remains an open challenge, warranting further investigation to mitigate potential risks associated with synthetic data alterations.

In conclusion, "Nemotron-CC" provides a valuable framework for transforming Common Crawl data into a refined, expansive pretraining dataset suitable for long-horizon training, thereby advancing the capabilities of LLMs in addressing a broader range of tasks with improved accuracy and efficiency.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Dan Su (101 papers)
Kezhi Kong (12 papers)
Ying Lin (19 papers)
Joseph Jennings (10 papers)
Brandon Norick (6 papers)
Markus Kliegl (7 papers)
Mostofa Patwary (34 papers)
Mohammad Shoeybi (60 papers)
Bryan Catanzaro (123 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/MarkusKliegl/status/1864398488160923885

https://twitter.com/NVIDIAAIDev/status/1864845691681435883

https://twitter.com/rohanpaul_ai/status/1866466949997514858

https://twitter.com/xlr8harder/status/1925231489186144295