What's In My Big Data? (2310.20707v2)

Published 31 Oct 2023 in cs.CL and cs.LG

Abstract: Large text corpora are the backbone of LLMs. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular LLMs, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them.

PDF Abstract

Analysis of the "What's In My Big Data?" Paper

This essay provides an analysis of the research paper "What's In My Big Data?" (wimbd), which focuses on understanding the composition of large-scale text corpora used in training LLMs. Large datasets, leveraged for pretraining LLMs (LMs), are growing in importance as the field of AI progresses. However, they often come with challenges in terms of understanding their nature, especially concerning data quality, potential biases, and possible ethical considerations.

Key Contributions and Findings

The authors embarked on developing a platform, wimbd, to analyze and document large text datasets comprehensively. The platform enables two main types of analysis: retrieving documents using a search mechanism, and computing statistics through counting operations. This combination allows for extensive evaluations of corpora amounting to tens of terabytes in size, using standard computational resources.

Remarkably, the authors present wimbd's application across ten significant English corpora, like C4, The Pile, RedPajama, and others, which represent a mix of domain-specific and broad web-scraped data. The paper identifies numerous interesting characteristics and inconsistencies within these datasets:

Data Statistics and Quality: For instance, in RedPajama and LAION-2B-en datasets, 50% of documents are duplicates. Furthermore, many long documents are either purely duplicates or documentations copied from open sources.
Community and Societal Factors: A notable fraction of documents contain toxic language and personally identifiable information (PII), raising flags on ethical use and privacy risks. For example, in mC4-en, approximately 4 billion phone numbers were detected.
Benchmark Contamination: A crucial finding relates to contamination with popular evaluation benchmarks, e.g., elements of GLUE and SuperGLUE within training datasets, which challenges the robustness of model evaluation.

Implications and Future Directions

The paper's results underscore a critical need for rigorous analysis and documentation of the data used to train LMs. Firstly, this research uncovers that duplicated and low-quality data is prevalent, suggesting that model performance may improve noticeably with better data curation. The paper implicitly argues for the utility of using refined subsets rather than raw, larger datasets, provided data deduplication and cleaning are performed thoroughly.

Secondly, the presence of toxic language and PII in these datasets could potentially propagate biases and lead to privacy breaches. This necessitates that future efforts in dataset curation need to emphasize ethical considerations, requiring guidelines and methods for filtering harmful or sensitive content.

From a methodological standpoint, the use of scalable count and search mechanisms opens up potential for similar approaches to be adapted for newer, growing corpora and other modalities beyond text. Extending this framework to analyze multimodal datasets or those for other domains can be key in achieving comprehensive understanding and responsible use of large datasets.

Conclusion

The paper offers invaluable insights into the current state of datasets that form the backbone of LM developments. By providing a framework and initial findings, it significantly contributes to discourse on dataset quality, ethical use, and transparency in AI. Although current results spotlight pressing issues, they also posit a path for the AI community to follow for creating and deploying more reliable and fair models, grounded in well-understood and appropriately curated data. The release of wimbd facilitates such advancements, allowing practitioners to proactively manage the contents of newly developed datasets. This work is pivotal in prompting a paradigm shift where data, akin to models, is subject to meticulous documentation and evaluation.