DataComp-LM: Towards Next-Generation Training Sets for LLMs
The paper "DataComp-LM: In search of the next generation of training sets for LLMs" introduces DataComp for LLMs (DCLM), a comprehensive benchmark designed to evaluate and refine LLM datasets. The paper posits that while the scaling of models contributes significantly to enhanced performance, the quality and composition of the training data are equally pivotal. DCLM encompasses a structured corpus derived from Common Crawl, effective pretraining practices via the OpenLM framework, and an array of 53 downstream tasks to facilitate extensive evaluation.
Research Contributions
Novel Testbed and Dataset Creation
The primary contribution of this work lies in the creation and release of DCLM, which includes:
- DCLM-Pool: A 240 trillion token corpus from Common Crawl, representing the largest publicly available dataset for LM training.
- Training Recipes: Leveraging OpenLM, the authors provide pretraining configurations for models ranging from 412 million to 7 billion parameters.
- Evaluation Suite: The authors include an extensive suite of 53 downstream evaluation tasks, enabling a robust assessment of models.
Baselines and Findings
The authors performed rigorous baseline experiments, establishing that model-based filtering is crucial for distilling high-quality datasets from vast corpora. Their baseline dataset, DCLM-baseline, allows a 7B parameter model to attain a 64% 5-shot accuracy on MMLU using only 2.6 trillion training tokens — surpassing the MAP-Neo state-of-the-art with 40% less computational resources. Additionally, the baseline model's performance is competitive with models like Mistral-7B-v0.3 and Llama 3 8B, while requiring significantly less compute.
Detailed Analysis and Methodologies
Filtering and Deduplication Strategies
In the construction of DCLM-baseline, the authors experimented with multiple data curation strategies:
- Text Extraction: Comparing trafilatura, resiliparse, and Common Crawl's WET files revealed that resiliparse provides a balanced trade-off between content quality and processing efficiency.
- Deduplication: Both MinHash-based and Bloom filter-based deduplication techniques were evaluated. The latter proved superior in scalability for datasets beyond 10 TB.
- Model-based Filtering: FastText classifiers trained with instruction-formatted data (OH-2.5 and ELI5) were particularly effective, enabling significant performance improvements over simpler quality filtering methods.
Evaluation Metrics
The evaluation suite encompasses 53 tasks spanning various domains like commonsense reasoning, factual recall, and problem-solving, with a specific focus on:
- MMLU Accuracy: Especially highlighted due to its adoption in recent models, providing a common ground for comparative analysis.
- Core and Extended Metrics: The Core metric evaluates performance on a subset of fundamental tasks, while the Extended metric offers a broader performance view across all tasks.
Implications and Future Directions
The substantial gains observed through systematic data curation suggest that future efforts in LLM development should prioritize dataset quality and composition. The findings encourage further exploration of data-centric methodologies, potentially integrating novel sources and advanced filtering approaches.
The authors also call for extending DCLM to cover other crucial aspects such as fairness, multilinguality, and safety. Moreover, instruction tuning and domain-specific fine-tuning, especially in areas like code and math, have been proposed as immediate next steps to further enhance model capabilities.
Conclusion
"DataComp-LM: In search of the next generation of training sets for LLMs" provides a pivotal step towards understanding the impact of training data on LLMs. The presented DCLM benchmark, comprehensive evaluation suite, and encouraging experimental results underscore the importance of systematic data curation. Moving forward, DCLM offers a versatile testbed for researchers aimed at refining LLM datasets, promising more efficient and performant models in the evolving landscape of artificial intelligence.
Acknowledgements
This research was supported by various organizations including Open Philanthropy, Allen Institute for AI, and several academic institutions. The community involvement and open-source tool releases underscore a collaborative effort towards advancing LLM research. The benchmark and datasets are publicly available at https://datacomp.ai/dclm, providing a foundation for continued exploration and innovation.