Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets (2103.12028v4)

Published 22 Mar 2021 in cs.CL and cs.AI

Abstract: With the success of large-scale pre-training and multilingual modeling in NLP, recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

Auditing Web-Crawled Multilingual Datasets: An Overview

The paper "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets" by Kreutzer et al. investigates the quality and reliability of multilingual datasets that have become a central component in NLP. The focus is on web-mined resources that claim to provide language data across hundreds of languages, a promise pivotal to advancing NLP for low-resource languages. The paper critiques major datasets such as CCAligned, ParaCrawl, WikiMatrix, OSCAR, and mC4, accounting for their drawbacks and instances of mislabeling, wrong language usage, non-linguistic content, and varying degrees of quality.

Key Findings

  • Prevalence of Low-Quality Data: The examination reveals a concerning amount of low-quality content, especially in datasets supposed to represent low-resource languages. Overall, 15 languages in the audit contained not a single usable sentence, and 87 languages had below 50% usable data.
  • Labeling and Language Code Errors: Across the datasets audited, numerous errors were identified, including the use of nonstandard language codes, incorrect language labeling, and issues related to language supersets. The JW300 dataset notably contained 48 language codes claiming to be sign languages but consisted of unrelated high-resource language text.
  • Challenges in Automatic Filtering: The inadequacy of open-source language identification (LangID) models in filtering these datasets is emphasized. A LangID model trained on transformers could not provide a quick fix due to significant precision-recall trade-offs.
  • Correlation with Downstream Applications: The paper draws a connection between dataset quality and the performance of downstream applications like translation models, indicating a modest correlation between translation quality and dataset correctness.

Implications

  • Impact on NLP for Low-Resource Languages: The findings bring to light the risks of inaccurate data propagating into NLP tools. Misrepresentation through mislabeled data might lead to incorrect conclusions about the capabilities of LLMs for underrepresented languages and potentially misdirect research and resources.
  • Assessing Trust and Facts: The potential for models to generate unreliable outputs based on erroneous parallel data, as documented in the analyses, warns against unverified reliance on automatically generated text and translations. These problems underscore the necessity for better mechanisms to verify machine-generated content.
  • Future Directions: The authors suggest enhancing dataset documentation and tailoring data filtering techniques to address characteristic errors. Increasing awareness about data quality and promoting thorough human-in-the-loop auditing can mitigate risks associated with poor-quality digital resources. Additionally, the development of better language agnostic filtering systems and improved LangID methods will be pivotal to unlocking progress for low-resource languages.

Conclusion

The work of Kreutzer et al. provides a critical contribution to the NLP field by assessing the current state of multilingual datasets. It posits a cautionary perspective on the rapid scaling and deployment of models trained on web-crawled data without sufficient quality checks. These insights propel the need for focused attention on data quality frameworks and transparent documentation practices. As researchers, the paper encourages us to scrutinize and contribute to refining filtering methodologies, ensuring that progress in NLP is both inclusive and accurate. The audit serves as a reminder of the ongoing challenges in democratizing NLP advancements across all languages equitably.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (52)
  1. Julia Kreutzer (44 papers)
  2. Isaac Caswell (19 papers)
  3. Lisa Wang (11 papers)
  4. Ahsan Wahab (3 papers)
  5. Daan van Esch (11 papers)
  6. Nasanbayar Ulzii-Orshikh (2 papers)
  7. Allahsera Tapo (2 papers)
  8. Nishant Subramani (16 papers)
  9. Artem Sokolov (22 papers)
  10. Claytone Sikasote (6 papers)
  11. Monang Setyawan (1 paper)
  12. Supheakmungkol Sarin (4 papers)
  13. Sokhar Samb (3 papers)
  14. BenoƮt Sagot (60 papers)
  15. Clara Rivera (8 papers)
  16. Annette Rios (10 papers)
  17. Isabel Papadimitriou (13 papers)
  18. Salomey Osei (21 papers)
  19. Pedro Ortiz Suarez (15 papers)
  20. Iroro Orife (20 papers)
Citations (250)