Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages (2403.08693v1)

Published 13 Mar 2024 in cs.CL

Abstract: Large, curated, web-crawled corpora play a vital role in training LLMs (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. However, during the extrinsic evaluation, we actually find that the CC100 corpus achieves the highest scores. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Does corpus quality really matter for low-resource languages? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7383–7390, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  2. MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 303–304, Ghent, Belgium. European Association for Machine Translation.
  3. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
  4. Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6588–6608, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  5. Parsing with multilingual BERT, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1324–1334, Online. Association for Computational Linguistics.
  6. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  7. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
  8. Bertje: A dutch bert model. arXiv preprint arXiv:1912.09582.
  9. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  10. Abteen Ebrahimi and Katharina Kann. 2021. How to adapt your pretrained multilingual model to 1600 languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4555–4567, Online. Association for Computational Linguistics.
  11. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  12. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
  13. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics.
  14. MIM-GOLD-NER – named entity recognition corpus (21.09). CLARIN-IS.
  15. Marcin Junczys-Dowmunt. 2019. Microsoft translator at WMT 2019: Towards large-scale document-level neural machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 225–233, Florence, Italy. Association for Computational Linguistics.
  16. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  17. Automatic genre identification for robust enrichment of massive text collections: Investigation of classification methods in the era of large language models. Machine Learning and Knowledge Extraction, 5(3):1149–1175.
  18. FlauBERT: Unsupervised language model pre-training for French. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2479–2490, Marseille, France. European Language Resources Association.
  19. Nikola Ljubešić and Davor Lauc. 2021. BERTić - the transformer language model for Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 37–42, Kiyv, Ukraine. Association for Computational Linguistics.
  20. Alexandra Luccioni and Joseph Viviano. 2021. What’s in the box? An analysis of undesirable content in the Common Crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 182–189, Online. Association for Computational Linguistics.
  21. H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics, 18(1):50 – 60.
  22. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online. Association for Computational Linguistics.
  23. Scaling data-constrained language models. In Advances in Neural Information Processing Systems, volume 36, pages 50358–50376. Curran Associates, Inc.
  24. When being unseen from mBERT is just the beginning: Handling new languages with multilingual language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 448–462, Online. Association for Computational Linguistics.
  25. Jonathan Oliver and Josiah Hagen. 2021. Designing the elements of a fuzzy hashing scheme. In 2021 IEEE 19th International Conference on Embedded and Ubiquitous Computing (EUC), pages 1–6.
  26. OpenAI. 2023. Gpt-4 technical report. Technical report, OpenAI.
  27. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim. Leibniz-Institut für Deutsche Sprache.
  28. AdapterFusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, Online. Association for Computational Linguistics.
  29. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673.
  30. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  31. T. C. Rajapakse. 2019. Simple transformers. https://github.com/ThilinaRajapakse/simpletransformers.
  32. Surangika Ranathunga and Nisansa de Silva. 2022. Some languages are more equal than others: Probing deeper into the linguistic disparity in the NLP world. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 823–848, Online only. Association for Computational Linguistics.
  33. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95.
  34. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  35. Stefan Schweter. 2020. BERTurk - BERT models for Turkish.
  36. AlephBERT: Language model pre-training and evaluation from sub-word to sentence level. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 46–56, Dublin, Ireland. Association for Computational Linguistics.
  37. A warm start and a clean crawled corpus - a recipe for good language models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4356–4366, Marseille, France. European Language Resources Association.
  38. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Intelligent Systems, pages 403–417, Cham. Springer International Publishing.
  39. LLAMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  40. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  41. Extending multilingual BERT to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2649–2656, Online. Association for Computational Linguistics.
  42. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
  43. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  44. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  45. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  46. Ljubešić, Nikola. 2021. Choice of plausible alternatives dataset in Croatian COPA-HR. Slovenian language resource repository CLARIN.SI.
  47. Choice of plausible alternatives dataset in Serbian COPA-SR. Slovenian language resource repository CLARIN.SI.
  48. Slovene translation of SuperGLUE. Slovenian language resource repository CLARIN.SI.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Rik van Noord (17 papers)
  2. Taja Kuzman (7 papers)
  3. Peter Rupnik (7 papers)
  4. Nikola Ljubešić (24 papers)
  5. Miquel Esplà-Gomis (8 papers)
  6. Gema Ramírez-Sánchez (6 papers)
  7. Antonio Toral (35 papers)

Summary

Evaluating the Impact of Text Quality in Web-Crawled Corpora for LLM Training

Introduction to Corpus Quality and LLM Performance

The deployment of web-crawled corpora in training LLMs (LMs) represents a cornerstone of recent advancements in NLP. These corpora, often vast and unstructured, raise critical questions about the role of data quality on LM performance. In response to this, an insightful paper spans an extensive evaluation across four populous web-crawled corpora, namely CC100, MaCoCu, mC4, and OSCAR, focusing on their qualitative aspects and the subsequent effects on training LLMs across eleven lower-resourced European languages.

Manual Evaluation: A Dive into Data Quality

The researchers embarked on a detailed examination of the aforementioned corpora through a dual-phase evaluation. The manual evaluation entailed professional linguists scrutinizing data quality based on a multi-tiered annotation scheme. This phase revealed significant qualitative differences among the corpora. MaCoCu and OSCAR emerged as front-runners in quality, exhibiting a higher prevalence of publishable and coherent running text, while mC4 exhibited notable deficiencies, especially in representing the Maltese language, where a large fraction of data was inaccurately labeled or lacked coherence.

Automatic Evaluation: Exploring LLM Performance

The paper advanced to an automatic evaluation by training LLMs on segments of these corpora for a subset of five languages. Surprisingly, despite the quality disparities highlighted during the manual evaluation, CC100 led in achieving superior performance in downstream tasks, suggesting a complex relationship between raw data quality and LM efficacy. This phase interestingly showcased that the quality of data, as judged through human evaluation, does not straightforwardly translate to expected outcomes in LM training, underscoring the resilience of LMs to diverse data quality.

Implications and Future Directions

The findings prompt a reevaluation of the criteria for curating training corpora for LMs, especially under the premise that sheer data volume does not equate to quality. This research unequivocally shifts the discourse towards understanding the nuanced dynamics between data quality and LM performance, challenging the prevailing notion that higher quality datasets invariably lead to superior model performance.

This work also sets the stage for future explorations into the mechanisms through which LMs can adapt to or leverage variances in data quality. It opens up pathways for developing more robust models that can efficiently handle the intricacies presented by web-crawled corpora, particularly for lower-resourced languages that often suffer from data paucity and quality issues.

In conclusion, this paper contributes a critical perspective to the ongoing dialogue on the optimization of data sources for LM training, providing evidence-based insights that question established assumptions and pave the way for refining data curation practices and model training methodologies in the evolving landscape of natural language processing.

X Twitter Logo Streamline Icon: https://streamlinehq.com