Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2402.00159v2)

Published 31 Jan 2024 in cs.CL
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Abstract: Information about pretraining corpora used to train the current best-performing LLMs is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on LLMing, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on LLM pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.

Introduction

LLMs (LMs) are critical for a wide array of natural language processing tasks, from question answering to summarization. However, research into the specifics of LM development, particularly about the composition of their pretraining data, is often obscured, either due to proprietary secrecy or lack of comprehensive documentation. Detailing and releasing an extensive pretraining dataset can address this oversight and propel open research. This is where the dataset termed "Dolma" comes into play - a publicly available, three trillion-token English corpus crafted from a variety of sources including web content, scientific literature, software code, public-domain literature, and encyclopedic materials.

Dolma Design Goals

The dataset was sculpted with specific design requirements to enhance transparency and reproducibility in LM research. These included consistency with existing LLM pretraining practices, scale compatibility for large model training, contributions to the public domain, and extensive efforts to minimize the risks of harm from potentially sensitive content. Not only does Dolma match scale and diversity with known pretraining corpora, but it also amplifies this with meticulous curation to limit the use of potentially harmful data such as personal identification information and derogatory content.

Data Curation and Toolkit

A high-performance toolkit has been developed to efficiently process large volumes of text for LLM pre-training. This toolkit serves multiple purposes: extensive language filtering to ensure English-only content, various quality filtering techniques to eliminate low-quality text entries, and implementation of deduplication methods at different granularities. Moreover, to make this corpus valuable for training and analyzing the performance of LLMs, additional filtering has been carried out to mask or remove personal data, and to mitigate the spread of undesired, including toxic, content.

Experiments and Findings

The paper includes a range of experiments that measure model performance on domain fit and downstream tasks at various intermediate states of Dolma. Such ablation studies are paramount in understanding how different subsets impact LM capabilities. For instance, the inclusion of source code data was found to benefit reasoning-related tasks, shedding light on the importance of multi-source mixing in dataset construction. Various experiments provided insights into the optimization of content filters to balance data quality with the breadth of information.

The release also marks the use of Dolma to train OLMo, a state-of-the-art open LLM, showcasing its effectiveness. What stands out is Dolma's versatility and readiness for various research avenues, given the open-source nature of both the dataset and the accompanying curation toolkit.

Conclusion

Dolma anchors the commitment to transparency and scrutiny in LLM training. It sets a new benchmark for dataset scale, diversity, and curation quality, paving the way for more inclusive and less biased research in the field of LLMing. This corpus invites the broader community into a collaborative effort towards advances in LLM research, anchored in principles of openness and responsible AI.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (36)
  1. Luca Soldaini (62 papers)
  2. Rodney Kinney (8 papers)
  3. Akshita Bhagia (12 papers)
  4. Dustin Schwenk (15 papers)
  5. David Atkinson (33 papers)
  6. Russell Authur (4 papers)
  7. Ben Bogin (22 papers)
  8. Khyathi Chandu (17 papers)
  9. Jennifer Dumas (2 papers)
  10. Yanai Elazar (44 papers)
  11. Valentin Hofmann (21 papers)
  12. Ananya Harsh Jha (8 papers)
  13. Sachin Kumar (68 papers)
  14. Li Lucy (12 papers)
  15. Xinxi Lyu (5 papers)
  16. Nathan Lambert (37 papers)
  17. Ian Magnusson (12 papers)
  18. Jacob Morrison (15 papers)
  19. Niklas Muennighoff (56 papers)
  20. Aakanksha Naik (23 papers)
Citations (156)
Youtube Logo Streamline Icon: https://streamlinehq.com