Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages (2403.08693v1)

Published 13 Mar 2024 in cs.CL

Abstract: Large, curated, web-crawled corpora play a vital role in training LLMs (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. However, during the extrinsic evaluation, we actually find that the CC100 corpus achieves the highest scores. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.

PDF HTML Abstract

Evaluating the Impact of Text Quality in Web-Crawled Corpora for LLM Training

Introduction to Corpus Quality and LLM Performance

The deployment of web-crawled corpora in training LLMs (LMs) represents a cornerstone of recent advancements in NLP. These corpora, often vast and unstructured, raise critical questions about the role of data quality on LM performance. In response to this, an insightful paper spans an extensive evaluation across four populous web-crawled corpora, namely CC100, MaCoCu, mC4, and OSCAR, focusing on their qualitative aspects and the subsequent effects on training LLMs across eleven lower-resourced European languages.

Manual Evaluation: A Dive into Data Quality

The researchers embarked on a detailed examination of the aforementioned corpora through a dual-phase evaluation. The manual evaluation entailed professional linguists scrutinizing data quality based on a multi-tiered annotation scheme. This phase revealed significant qualitative differences among the corpora. MaCoCu and OSCAR emerged as front-runners in quality, exhibiting a higher prevalence of publishable and coherent running text, while mC4 exhibited notable deficiencies, especially in representing the Maltese language, where a large fraction of data was inaccurately labeled or lacked coherence.

Automatic Evaluation: Exploring LLM Performance

The paper advanced to an automatic evaluation by training LLMs on segments of these corpora for a subset of five languages. Surprisingly, despite the quality disparities highlighted during the manual evaluation, CC100 led in achieving superior performance in downstream tasks, suggesting a complex relationship between raw data quality and LM efficacy. This phase interestingly showcased that the quality of data, as judged through human evaluation, does not straightforwardly translate to expected outcomes in LM training, underscoring the resilience of LMs to diverse data quality.

Implications and Future Directions

The findings prompt a reevaluation of the criteria for curating training corpora for LMs, especially under the premise that sheer data volume does not equate to quality. This research unequivocally shifts the discourse towards understanding the nuanced dynamics between data quality and LM performance, challenging the prevailing notion that higher quality datasets invariably lead to superior model performance.

This work also sets the stage for future explorations into the mechanisms through which LMs can adapt to or leverage variances in data quality. It opens up pathways for developing more robust models that can efficiently handle the intricacies presented by web-crawled corpora, particularly for lower-resourced languages that often suffer from data paucity and quality issues.

In conclusion, this paper contributes a critical perspective to the ongoing dialogue on the optimization of data sources for LM training, providing evidence-based insights that question established assumptions and pave the way for refining data curation practices and model training methodologies in the evolving landscape of natural language processing.

PDF Markdown Bookmark Chat (Pro)

References (48)

Authors (7)

Rik van Noord (17 papers)
Taja Kuzman (7 papers)
Peter Rupnik (7 papers)
Nikola Ljubešić (24 papers)
Miquel Esplà-Gomis (8 papers)
Gema Ramírez-Sánchez (6 papers)
Antonio Toral (35 papers)

Tweets

https://twitter.com/rikvannoord/status/1768256022697652630