RedPajama: an Open Dataset for Training Large Language Models (2411.12372v1)

Published 19 Nov 2024 in cs.CL and cs.LG

Abstract: LLMs are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open LLMs. In this paper, we identify three core data-related challenges that must be addressed to advance open-source LLMs. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong LLMs used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only LLMs with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing LLMs at scale.

Citations (13)

View on Semantic Scholar

Summary

The paper introduces an open dataset for LLM training that details transparent data replication, preprocessing, and deduplication techniques.
It demonstrates effective ablation studies with decoder-only models up to 1.6 billion parameters and competitive benchmarks at 3 billion scale.
The work advances open-source AI research by standardizing dataset curation practices and promoting reproducibility in large language model development.

Overview of the RedPajama Dataset

The paper entitled "RedPajama: An Open Dataset for Training LLMs" addresses critical challenges in the field of LLM development, focusing on transparency, accessibility, and dataset curation. The research presents the RedPajama datasets, positioned as a comprehensive and transparent tool for developing open-source LLMs. This resource aims to fill the void left by prominent models that often lack transparency concerning their dataset methodologies.

The RedPajama project is articulated into two primary versions: RedPajama-V1 and RedPajama-V2. RedPajama-V1 reproduces the training set utilized for the LLaMA models, which includes subsets from diverse domains such as CommonCrawl, GitHub, C4, and arXiv. It sheds light on uncertainties that arise when attempting to replicate these data sources due to the lack of explicit details in team-released reports. To circumvent these challenges, RedPajama-V1 provides as much replication ease as possible, detailing the preprocessing and deduplication techniques deployed.

RedPajama-V2 ventures into creating a massive, web-only dataset, accompanied by quality indicators and metadata. Its purpose is to democratize the process by providing raw text with quality signals which researchers can employ to curate tailored datasets for specific needs. This large-scale endeavor covers more than 100 trillion tokens and is enhanced with multiple languages and quality signals for improved filtration based on natural language processing criteria.

Numerical Findings and Claims

The robustness of the RedPajama datasets is evidenced by their already broad adoption. They serve as foundational resources in developing diverse LLMs like Snowflake Arctic and AI2's OLMo, indicating their practical utility. The paper presents empirical analyses using ablation studies with decoder-only models up to 1.6 billion parameters. These studies demonstrate the effectiveness of the quality signals in selecting high-value data subsets and highlight improvements in model performance metrics. The benchmark results for RedPajama-INCITE, particularly at 3 billion scale, are competitive compared to contemporary open models, suggesting that despite potential data mismatches due to reconstruction ambiguities, RedPajama provides strong baseline performance.

Implications and Future Directions

RedPajama's design principles focus on transparency, scale, and versatility significantly impact AI dataset curation practices, fostering a more open, understandable, and customizable framework for building LLMs. By releasing an extensive dataset with detailed documentation and metadata, the project advocates for a standardized approach to dataset disclosure and management, a move likely to be influential for future developments in AI openness and reproducibility.

The practical implications of RedPajama include a wider accessibility for developers at smaller scales, supporting diversified LLM development that can fit varied industry sectors and research requirements. Theoretically, this work pushes towards a more intricate understanding of data filtering heuristics in model training, influencing how quality signals can optimize data subsets for improved AI performance.

Future research may extend beyond the constraints discussed, such as evaluating RedPajama on larger scales or diversifying its application in more complex benchmark scenarios. Additionally, addressing the absence of comprehensive decontamination and privacy safeguards within the dataset remains an open task. Collectively, this work provides foundational insights and rich resources poised to instinctively shape the trajectory of open-source LLM research and ecosystem.