Papers
Topics
Authors
Recent
2000 character limit reached

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

Published 31 Oct 2024 in cs.CL and cs.AI | (2410.23825v2)

Abstract: The need for large text corpora has increased with the advent of pretrained LLMs and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. GlotSparse, 2023a. URL https://huggingface.co/datasets/cis-lmu/GlotSparse.
  2. GlotStoryBook, 2023b. URL https://huggingface.co/datasets/cis-lmu/GlotStoryBook.
  3. Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), pages 1 – 9, Mannheim, 2021. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-10468. URL https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688.
  4. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, art. arXiv:2201.06642, January 2022.
  5. AfroLID: A neural language identification tool for African languages. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1958–1981, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.128. URL https://aclanthology.org/2022.emnlp-main.128.
  6. SERENGETI: Massively multilingual language models for Africa. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1498–1537, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.97. URL https://aclanthology.org/2023.findings-acl.97.
  7. MasakhaNEWS: news topic classification for African languages. ArXiv, 2023.
  8. MEGA: Multilingual evaluation of generative AI. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.258. URL https://aclanthology.org/2023.emnlp-main.258.
  9. MEGAVERSE: Benchmarking large language models across languages, modalities, models and tasks. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2598–2637, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.143. URL https://aclanthology.org/2024.naacl-long.143.
  10. Common voice: A massively-multilingual speech corpus. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France, May 2020. European Language Resources Association. URL https://aclanthology.org/2020.lrec-1.520.
  11. Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983, 2022.
  12. Steven Bird. Decolonising speech and language technology. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 3504–3519, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.313. URL https://aclanthology.org/2020.coling-main.313.
  13. Language contamination helps explains the cross-lingual capabilities of English pretrained models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3563–3574, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.233. URL https://aclanthology.org/2022.emnlp-main.233.
  14. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. doi: 10.1162/tacl_a_00051. URL https://aclanthology.org/Q17-1010.
  15. Natural language processing with small feed-forward networks. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2879–2885, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1309. URL https://aclanthology.org/D17-1309.
  16. Ralf Brown. Language-aware string extractor, 2014. URL https://sourceforge.net/projects/la-strings/.
  17. Ralf D Brown. Finding and identifying text in 900+ languages. Digital Investigation, 9:S34–S43, 2012.
  18. An open dataset and model for language identification. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 865–879, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.75. URL https://aclanthology.org/2023.acl-short.75.
  19. No data to crawl? monolingual corpus creation from PDF files of truly low-resource languages in Peru. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 2914–2923, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://www.aclweb.org/anthology/2020.lrec-1.356.
  20. Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 6588–6608, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.579. URL https://aclanthology.org/2020.coling-main.579.
  21. Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  22. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  23. Jonathan Dunn. Mapping languages: The corpus of global language use. Language Resources and Evaluation, 54:999–1018, 2020.
  24. Geographically-informed language identification. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7672–7682, Torino, Italia, May 2024. ELRA and ICCL. URL https://aclanthology.org/2024.lrec-main.678.
  25. Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=vfT4YuzAYA.
  26. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  27. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  28. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  29. Glottolog 5.0. Max Planck Institute for Evolutionary Anthropology, 2024. doi: 10.5281/zenodo.10804357. URL http://glottolog.org. Available online at http://glottolog.org, Accessed on 2024-06-01.
  30. Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.61.
  31. Wanca in korp: Text corpora for underresourced Uralic languages. In Proceedings of the Research data and humanities (RDHUM) 2019 conference. University of Oulu, 2019a.
  32. Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65:675–782, 2019b.
  33. HeLI-OTS, off-the-shelf language identifier for text. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3912–3922, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.416.
  34. The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.560. URL https://aclanthology.org/2020.acl-main.560.
  35. FastText.zip: Compressing text classification models, 2016. URL https://arxiv.org/abs/1612.03651.
  36. Incorporating dialectal variability for socially equitable language identification. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 51–57, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-2009. URL https://aclanthology.org/P17-2009.
  37. GlotLID: Language identification for low-resource languages. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6155–6218, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.410. URL https://aclanthology.org/2023.findings-emnlp.410.
  38. MEXA: Multilingual evaluation of english-centric LLMs via cross-lingual alignment, 2024a. URL https://arxiv.org/abs/2410.05873.
  39. MaskLID: Code-switching language identification through iterative masking. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 459–469, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-short.43. URL https://aclanthology.org/2024.acl-short.43.
  40. GlotScript: A resource and tool for low resource writing system identification. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7774–7784, Torino, Italia, May 2024c. ELRA and ICCL. URL https://aclanthology.org/2024.lrec-main.687.
  41. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 01 2022. ISSN 2307-387X. doi: 10.1162/tacl_a_00447. URL https://doi.org/10.1162/tacl_a_00447.
  42. Madlad-400: A multilingual and document-level large audited dataset. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 67284–67296. Curran Associates, Inc., 2023. URL https://openreview.net/pdf?id=Y45ZCxslFx.
  43. Preparing the vuk’uzenzele and ZA-gov-multilingual South African multilingual corpora. In Rooweither Mabuya, Don Mthobela, Mmasibidi Setaka, and Menno Van Zaanen, editors, Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023), pages 18–25, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.rail-1.3. URL https://aclanthology.org/2023.rail-1.3.
  44. The bigscience roots corpus: A 1.6tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826, 2022.
  45. Bloom library: Multimodal datasets in 300+ languages for a variety of downstream tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8608–8621, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.590. URL https://aclanthology.org/2022.emnlp-main.590.
  46. DataComp-LM: In search of the next generation of training sets for language models. arXiv preprint arXiv:2406.11794, 2024.
  47. langid.py: An off-the-shelf language identification tool. In Min Zhang, editor, Proceedings of the ACL 2012 System Demonstrations, pages 25–30, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL https://aclanthology.org/P12-3005.
  48. Bhasha-abhijnaanam: Native-script and romanized language identification for 22 indic languages, 2023.
  49. Shervin Malmasi. Open-set language identification. arXiv preprint arXiv:1707.04817, 2017.
  50. Michael McCandless. Accuracy and performance of google’s compact language detector. Blog post, 2010.
  51. AI Meta. Introducing Meta Llama 3: The most capable openly available LLM to date. 2024. URL https://ai.meta.com/blog/meta-llama-3.
  52. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229, 2019.
  53. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023.
  54. Universal Dependencies v2: An evergrowing multilingual treebank collection. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France, May 2020. European Language Resources Association. URL https://aclanthology.org/2020.lrec-1.497.
  55. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.
  56. Afriqa: Cross-lingual open-retrieval question answering for african languages, 2023.
  57. OpenAI. GPT4-o (May 13 version), 2024. URL https://openai.com/index/hello-gpt-4o/.
  58. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
  59. Datatrove: large scale data processing, 2024a. URL https://github.com/huggingface/datatrove.
  60. The FineWeb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557, 2024b.
  61. Fabrice Prigent. UT1 Blacklists, 2024. URL https://github.com/olbat/ut1-blacklists.
  62. Scaling language models: Methods, analysis & insights from training gopher, 2021.
  63. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  64. Compact language detector v3. 2018. URL https://chromium.googlesource.com/external/github.com/google/cld_3/.
  65. Nakatani Shuyo. Language detection library for java, 2010. URL https://github.com/shuyo/language-detection.
  66. AI4D–African Language Program. arXiv preprint arXiv:2104.02516, 2021.
  67. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024.
  68. Open-set recognition: A good closed-set classifier is all you need. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=5hLP5JY9S2d.
  69. Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 4003–4012, 2020.
  70. NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages, 2022.
  71. Caswell Wormer. Fun LangID, 2023. URL https://github.com/google-research/url-nlp/tree/main/fun_langid.
  72. Titus Wormer. Franc Library, 2014. URL https://github.com/wooorm/franc.
  73. mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
  74. Machine translation for low-resource Finno-Ugric languages. In Tanel Alumäe and Mark Fishel, editors, Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 762–771, Tórshavn, Faroe Islands, May 2023. University of Tartu Library. URL https://aclanthology.org/2023.nodalida-1.77.

Summary

  • The paper presents a novel corpus, GlotCC, that significantly enhances the availability and quality of minority language data extracted from CommonCrawl.
  • It introduces GlotLID v3.0, a robust language identification model that supports over 2000 language-script pairs and effectively manages web noise.
  • Its comprehensive processing pipeline and rigorous self-audit yield high in-language accuracy, setting new standards for multilingual NLP resources.

A Comprehensive Analysis of GlotCC: Advancing Corpus Resources for Minority Languages

The paper "GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages" addresses the pressing need for a diverse and substantial language corpus tailored to minority languages. This work introduces GlotCC, a comprehensive and clean corpus derived from CommonCrawl, alongside its supporting system that includes the GlotLID language identification model and an open-source processing pipeline. Through various enhancements and extensive evaluations, this study extends the accessibility and reliability of corpora for over 1000 languages, significantly contributing to multilingual language technology research.

Core Contributions

  1. Corpus Development: GlotCC emerges as a dataset covering an expansive range of languages, particularly minority ones, compiled through an improved language identification strategy and noise reduction processes. This initiative addresses the scarcity of multilingual resources in low-resource contexts, often neglected due to the dominance of a few high-resource languages.
  2. Enhanced Language Identification: The introduction of GlotLID v3.0 represents a substantial advancement over existing language identification models like FastText and CLD3. By supporting more than 2000 language-script pairs and integrating specialized noise-handling mechanisms (e.g., noise detection labels "zxx" and "UND"), GlotLID v3.0 offers a higher accuracy and coverage. Such improvements are pivotal for minimizing misidentification issues and ensuring cleaner data extraction from web sources.
  3. Pipeline and Filtering Innovations: The paper details an elaborate pipeline based on Ungoliant, and extensions tailored to address the limitations of previous tools. New quality control warnings and filters are integrated to ensure content consistency and remove residual noise, enriching the corpus with high-quality linguistic data free from prevalent web noise artifacts like Mojibake and mis-rendered PDFs.
  4. Self-Audit and Evaluation: Evaluating GlotCC, the authors audit the dataset by analyzing random samples from different language subcorpora. This audit confirms a high in-language content accuracy with macro-average and median scores showcasing minimal misclassification, thus attesting to the robustness of the GlotLID model and filtering processes. The comparison with other LID models displays significant improvements in identifying minority languages.

Quantitative Results

The GlotLID v3.0 model has consistently shown robust performance across various benchmarks. It achieved an F1 score of 0.991 with a false positive rate of 0.000003 on the GlotTest evaluation set. A notable increase in coverage for minority languages was achieved, as evidenced by GlotCC's compilation statistics, which include more than 1275 LID labels. GlotCC surpasses traditional models while retaining accuracy, thus enhancing its reliability for training multilingual LLMs.

Implications and Further Developments

The theoretical implications of GlotCC lie in its ability to unify language processing methodologies for low-resource languages by providing clean and labeled corpora. Practically, GlotCC broadens language inclusion in NLP tasks, facilitates the development of more sophisticated LLMs, and helps democratize AI applications across diverse linguistic landscapes.

The paper suggests several paths for future developments. It aims to extend the corpus to incorporate additional CommonCrawl snapshots, thereby continuously updating the language database and further enhancing coverage. Additionally, collaborations with linguistic communities could help fine-tune language identification and data filtering processes, ensuring cultural accuracy and authenticity.

In conclusion, GlotCC and its underpinning systems mark a significant stride forward in corpus linguistics and language technology for minority languages. The research effectively bridges a critical resource gap, enabling broader LLM training while upholding ethical standards and focusing on inclusivity in the digital era.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 62 likes about this paper.