CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages (2309.09400v1)

Published 17 Sep 2023 in cs.CL and cs.AI

Abstract: The driving factors behind the development of LLMs with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.

PDF Abstract

CulturaX: Enabling Transparent, Multilingual LLM Training

The paper "CulturaX: A Cleaned, Enormous, and Multilingual Dataset for LLMs in 167 Languages" introduces a significant contribution to the field of multilingual NLP, particularly in the context of training LLMs. As LLMs continue to revolutionize natural language processing applications, the paper recognizes the crucial role of large-scale, high-quality training datasets in optimizing model performance. However, it points out a prevalent issue in the field: the lack of transparency in the sources and compositions of datasets used in state-of-the-art LLMs, which stifles further research and understanding.

Dataset Composition and Methodology

CulturaX addresses these issues by introducing a new multilingual dataset comprising approximately 6.3 trillion tokens across 167 languages. This large-scale dataset is obtained by integrating and refining data from mC4 and OSCAR, two well-regarded web-sourced multilingual corpora, to cover a diverse range of languages at a significant scale. Notably, more than half of the dataset's tokens are dedicated to non-English languages, thereby emphasizing its multilingual nature.

The authors detail a meticulous pipeline for cleaning and deduplication to enhance data quality. This process involves several stages:

Language Identification: Improving upon mC4’s use of the {\tt cld3} tool, which presents significant language detection limitations, by employing the FastText model known for its superior accuracy.
URL-based Filtering: Utilizing a comprehensive URL blacklist to remove pages from toxic and unreliable sources.
Metric-based Cleaning: Leveraging metrics such as stopword ratio and perplexity scores, calibrated via adaptive thresholds using the interquartile range, to exclude outlier documents.
Document Refinement: Eliminating residual document issues like footer lines and JavaScript code.
Data Deduplication: Implementing MinHashLSH for near-deduplication at the document level, critical for fostering generalization and reducing memorization in LLMs.

Implications and Future Directions

CulturaX sets a new benchmark for transparency and accessibility in the training of multilingual LLMs. By providing a publicly available dataset, it crucially aids in the democratization of LLM development, promoting research advancements in addressing model biases, hallucination issues, and improving cross-linguistic model performance.

With CulturaX facilitating a more open exploration of multilingual data attributes and accuracies, the paper posits several future avenues for research. These include the exploration of improved model architectures that leverage such extensive multilingual data and considerations concerning the ethical implications and biases inherent in large-scale data collection. Additionally, the replication of this approach could lead to the development of similar datasets focused on other resource-constrained languages or specialized domains.

The paper’s meticulous approach raises standards for dataset creation in NLP, prompting a move towards more open and replicable research practices, fostering enhancements in both practical applications of NLP models and foundational theories concerning LLM training.

In conclusion, CulturaX will likely serve as a cornerstone for future efforts to train comprehensive multilingual LLMs, offering a crucial resource to researchers seeking to push the boundaries of what LLMs can achieve in diverse linguistic contexts and applications.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Thuat Nguyen (2 papers)
Chien Van Nguyen (6 papers)
Viet Dac Lai (25 papers)
Hieu Man (4 papers)
Nghia Trung Ngo (8 papers)
Franck Dernoncourt (161 papers)
Ryan A. Rossi (124 papers)
Thien Huu Nguyen (61 papers)

Citations (65)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/michael_g_u/status/1781243552850157798

YouTube

Show All Videos