DataComp-LM: In search of the next generation of training sets for language models (2406.11794v3)

Published 17 Jun 2024 in cs.LG and cs.CL

Abstract: We introduce DataComp for LLMs (DCLM), a testbed for controlled dataset experiments with the goal of improving LLMs. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter LLM from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data LLMs, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training LLMs and offer a starting point for further research on data curation.

PDF HTML Abstract

DataComp-LM: Towards Next-Generation Training Sets for LLMs

The paper "DataComp-LM: In search of the next generation of training sets for LLMs" introduces DataComp for LLMs (DCLM), a comprehensive benchmark designed to evaluate and refine LLM datasets. The paper posits that while the scaling of models contributes significantly to enhanced performance, the quality and composition of the training data are equally pivotal. DCLM encompasses a structured corpus derived from Common Crawl, effective pretraining practices via the OpenLM framework, and an array of 53 downstream tasks to facilitate extensive evaluation.

Research Contributions

Novel Testbed and Dataset Creation

The primary contribution of this work lies in the creation and release of DCLM, which includes:

DCLM-Pool: A 240 trillion token corpus from Common Crawl, representing the largest publicly available dataset for LM training.
Training Recipes: Leveraging OpenLM, the authors provide pretraining configurations for models ranging from 412 million to 7 billion parameters.
Evaluation Suite: The authors include an extensive suite of 53 downstream evaluation tasks, enabling a robust assessment of models.

Baselines and Findings

The authors performed rigorous baseline experiments, establishing that model-based filtering is crucial for distilling high-quality datasets from vast corpora. Their baseline dataset, DCLM-baseline, allows a 7B parameter model to attain a 64% 5-shot accuracy on MMLU using only 2.6 trillion training tokens — surpassing the MAP-Neo state-of-the-art with 40% less computational resources. Additionally, the baseline model's performance is competitive with models like Mistral-7B-v0.3 and Llama 3 8B, while requiring significantly less compute.

Detailed Analysis and Methodologies

Filtering and Deduplication Strategies

In the construction of DCLM-baseline, the authors experimented with multiple data curation strategies:

Text Extraction: Comparing trafilatura, resiliparse, and Common Crawl's WET files revealed that resiliparse provides a balanced trade-off between content quality and processing efficiency.
Deduplication: Both MinHash-based and Bloom filter-based deduplication techniques were evaluated. The latter proved superior in scalability for datasets beyond 10 TB.
Model-based Filtering: FastText classifiers trained with instruction-formatted data (OH-2.5 and ELI5) were particularly effective, enabling significant performance improvements over simpler quality filtering methods.

Evaluation Metrics

The evaluation suite encompasses 53 tasks spanning various domains like commonsense reasoning, factual recall, and problem-solving, with a specific focus on:

MMLU Accuracy: Especially highlighted due to its adoption in recent models, providing a common ground for comparative analysis.
Core and Extended Metrics: The Core metric evaluates performance on a subset of fundamental tasks, while the Extended metric offers a broader performance view across all tasks.

Implications and Future Directions

The substantial gains observed through systematic data curation suggest that future efforts in LLM development should prioritize dataset quality and composition. The findings encourage further exploration of data-centric methodologies, potentially integrating novel sources and advanced filtering approaches.

The authors also call for extending DCLM to cover other crucial aspects such as fairness, multilinguality, and safety. Moreover, instruction tuning and domain-specific fine-tuning, especially in areas like code and math, have been proposed as immediate next steps to further enhance model capabilities.

Conclusion

"DataComp-LM: In search of the next generation of training sets for LLMs" provides a pivotal step towards understanding the impact of training data on LLMs. The presented DCLM benchmark, comprehensive evaluation suite, and encouraging experimental results underscore the importance of systematic data curation. Moving forward, DCLM offers a versatile testbed for researchers aimed at refining LLM datasets, promising more efficient and performant models in the evolving landscape of artificial intelligence.

Acknowledgements

This research was supported by various organizations including Open Philanthropy, Allen Institute for AI, and several academic institutions. The community involvement and open-source tool releases underscore a collaborative effort towards advancing LLM research. The benchmark and datasets are publicly available at https://datacomp.ai/dclm, providing a foundation for continued exploration and innovation.