2000 character limit reached
Datasheet for the Pile (2201.07311v1)
Published 13 Jan 2022 in cs.CL
Abstract: This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale LLMing. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.