Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset (2303.03915v1)

Published 7 Mar 2023 in cs.CL and cs.AI

Abstract: As LLMs grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training LLMs as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) LLM. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

The paper "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset" presents an extensive effort by the BigScience workshop to create a massive multilingual text dataset, ROOTS, intended for pre-training LLMs. This endeavor responds to the increasing demand for large-scale, high-quality text datasets, particularly in multilingual contexts. Besides the dataset itself, the project emphasizes ethics, governance, and participation.

ROOTS is a 1.6TB corpus encompassing 59 languages, designed to train the 176-billion-parameter BLOOM LLM. The creation process involved contributions from a diverse international assembly of researchers dedicated to ensuring not only linguistic diversity but also ethical and participatory curation practices. The corpus was developed by aggregating data from multiple sources, categorized into community-selected language resources, pseudo-crawled data from various domain-specific websites, and existing processed datasets like OSCAR.

The methodology employed in constructing ROOTS incorporates several layers of data curation and preprocessing. This includes employing language-specific filters to remove non-natural language content, deduplication techniques to ensure data quality, and tools for removing personally identifiable information. More than 500 unique datasets contributed to the corpus, with data representations spanning 13 programming languages and multiple natural language families.

An analysis of the corpus reveals substantial efforts in mitigating the inherent biases and issues associated with web-crawled data. For instance, the paper details filtering techniques based on perplexity scores, character repetition, special character ratios, and language identification to refine the textual data included. Deduplication strategies help lower privacy risks and improve model generalization capabilities, following findings that duplicate data inflates performance metrics artificially.

Additionally, the paper provides a comprehensive breakdown of the dataset's composition, noting the statistical significance of various language contributions. For example, English constitutes approximately 30% of the dataset, while Simplified Chinese accounts for 16%. This diversity aims to promote the capability of LLMs to handle a wider range of linguistic constructs.

The implications of this research extend both practically and theoretically. Practically, the ROOTS corpus enables the development of multilingual LLMs capable of understanding and generating text across different languages, ostensibly empowering various machine learning applications globally. Theoretically, it offers insights into best practices for corpus creation, emphasizing the importance of ethical curation, participatory approaches, and comprehensive documentation.

Future developments might include refinement in data scrubbing techniques to better address ethical concerns, enhanced participatory methods for data selection, and more sophisticated models capable of leveraging the nuanced features embedded within large multilingual datasets. The article marks a significant advancement in constructing datasets that will support robust, comprehensive LLMs, paving the way for further exploration in multilingual NLP systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (54)
  1. Hugo Laurençon (11 papers)
  2. Lucile Saulnier (10 papers)
  3. Thomas Wang (17 papers)
  4. Christopher Akiki (15 papers)
  5. Albert Villanova del Moral (6 papers)
  6. Teven Le Scao (18 papers)
  7. Chenghao Mou (7 papers)
  8. Eduardo González Ponferrada (2 papers)
  9. Huu Nguyen (12 papers)
  10. Jörg Frohberg (5 papers)
  11. Mario Šaško (4 papers)
  12. Quentin Lhoest (9 papers)
  13. Angelina McMillan-Major (8 papers)
  14. Stella Biderman (55 papers)
  15. Anna Rogers (27 papers)
  16. Francesco De Toni (5 papers)
  17. Giada Pistilli (10 papers)
  18. Olivier Nguyen (2 papers)
  19. Somaieh Nikpoor (3 papers)
  20. Maraim Masoud (7 papers)
Citations (150)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Reddit Logo Streamline Icon: https://streamlinehq.com