Zyda: A 1.3T Dataset for Open Language Modeling (2406.01981v2)

Published 4 Jun 2024 in cs.CL and cs.AI

Abstract: The size of LLMs has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art LLMs, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda's effectiveness, outperforming even the best of its constituent datasets when used independently.

PDF HTML Abstract

Zyda: A 1.3T Dataset for Open LLMing

The paper "Zyda: A 1.3T Dataset for Open LLMing" primarily focuses on addressing the growing need for extensive and high-quality datasets to train LLMs. The authors introduce Zyda, a dataset under a permissive license encompassing 1.3 trillion tokens. By integrating multiple major open-source datasets, the paper emphasizes the creation of a substantial corpus which enhances the quality of LLM pretraining. Rigorous filtering and deduplication protocols are applied to the dataset to ensure superior quality. The evaluation results indicate that Zyda significantly improves performance on various LLMing tasks and offers notable advantages over other open-source datasets.

Introduction

Over recent years, there has been an exponential increase in the scale, computational requirements, and capabilities of LLMs. While the scaling laws—specifically the Chinchilla scaling laws—provide a guide for optimally allocating resources, the computational focus has shifted towards inference-optimal models. These models, often smaller in size but trained on more tokens, highlight the necessity for large datasets with high quality. Despite this, open-source datasets have not kept pace with the rapid advancement in LLMs, prompting the need for larger, more robust datasets to enable competitive performance for open models.

Composition and Processing

The Zyda dataset amalgamates several respected open-source datasets such as The Pile, SlimPajama, RefinedWeb, C4, PeS2o, arxiv_s2orc_parsed, and StarCoder. Each dataset caters to different domains ranging from general LLMing and scientific writing to code. A rigorous filtering and deduplication process is performed to ensure dataset quality.

Filtering

The filtering process is executed in two primary stages—substring replacement followed by document-level filtering. Key filters include:

Removing documents with high proportions of alphanumeric characters or numbers
Filtering out objectionable content based on explicit word lists
Removing syntactically broken documents
Special handling of StarCoder due to its code-specific content

Through these filters, documents with gibberish content, excessive numerical data, spam, and low-quality text are effectively removed.

Deduplication

Deduplication employed LSH-based MinHash signatures to identify and remove duplicate documents both within and across datasets. Two Jaccard similarity thresholds (40% and 80%) were used to optimize deduplication rates. This step involved significant manual inspection to minimize false positives and ensure high precision in removing duplicate content.

Performance Evaluation

The performance of Zyda was evaluated by training multiple transformer models, comparing them against established datasets and benchmarks such as the Pythia models. Key findings include:

Zyda-trained models significantly outperform those trained on The Pile, especially in reasoning evaluations
Performance advantages appear to increase with model scale
Removal of the StarCoder subset from Zyda further enhances performance on non-code LLMing tasks

Empirical validation demonstrated Zyda's competitive edge over datasets like Dolma and individual subsets such as RefinedWeb and SlimPajama.

Implications

The introduction of Zyda has notable implications for the field:

It provides an expansive, high-quality resource for pretraining LLMs, making it possible for open-source efforts to remain competitive with proprietary models
The substantial improvements observed with meticulous filtering and deduplication practices underscore the importance of dataset quality over sheer volume
Zyda’s availability can catalyze further research on dataset processing techniques, encouraging standardization and facilitating fairer comparisons between varying architectures and training methods

Future Developments

While Zyda represents a significant step forward, future research could potentially expand its token count and enhance filtering methods. Utilizing advanced techniques like semantic clustering, LLM perplexity filtering, and data augmentation may further elevate dataset quality. Moreover, ongoing integration with newly released datasets like FineWeb can scale Zyda to vast and unprecedented token counts, approaching those of top-tier proprietary models.

In conclusion, Zyda serves as an essential resource fostering the continuous evolution of LLMs within the open-source community. Its structured development, characterized by comprehensive filtering and deduplication, sets a benchmark in dataset quality which will undoubtedly influence future methodologies and practices in AI research.