GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Abstract: The need for large text corpora has increased with the advent of pretrained LLMs and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.
- GlotSparse, 2023a. URL https://huggingface.co/datasets/cis-lmu/GlotSparse.
- GlotStoryBook, 2023b. URL https://huggingface.co/datasets/cis-lmu/GlotStoryBook.
- Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), pages 1 – 9, Mannheim, 2021. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-10468. URL https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688.
- Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, art. arXiv:2201.06642, January 2022.
- AfroLID: A neural language identification tool for African languages. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1958–1981, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.128. URL https://aclanthology.org/2022.emnlp-main.128.
- SERENGETI: Massively multilingual language models for Africa. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1498–1537, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.97. URL https://aclanthology.org/2023.findings-acl.97.
- MasakhaNEWS: news topic classification for African languages. ArXiv, 2023.
- MEGA: Multilingual evaluation of generative AI. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.258. URL https://aclanthology.org/2023.emnlp-main.258.
- MEGAVERSE: Benchmarking large language models across languages, modalities, models and tasks. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2598–2637, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.143. URL https://aclanthology.org/2024.naacl-long.143.
- Common voice: A massively-multilingual speech corpus. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France, May 2020. European Language Resources Association. URL https://aclanthology.org/2020.lrec-1.520.
- Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983, 2022.
- Steven Bird. Decolonising speech and language technology. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 3504–3519, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.313. URL https://aclanthology.org/2020.coling-main.313.
- Language contamination helps explains the cross-lingual capabilities of English pretrained models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3563–3574, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.233. URL https://aclanthology.org/2022.emnlp-main.233.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. doi: 10.1162/tacl_a_00051. URL https://aclanthology.org/Q17-1010.
- Natural language processing with small feed-forward networks. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2879–2885, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1309. URL https://aclanthology.org/D17-1309.
- Ralf Brown. Language-aware string extractor, 2014. URL https://sourceforge.net/projects/la-strings/.
- Ralf D Brown. Finding and identifying text in 900+ languages. Digital Investigation, 9:S34–S43, 2012.
- An open dataset and model for language identification. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 865–879, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.75. URL https://aclanthology.org/2023.acl-short.75.
- No data to crawl? monolingual corpus creation from PDF files of truly low-resource languages in Peru. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 2914–2923, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://www.aclweb.org/anthology/2020.lrec-1.356.
- Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 6588–6608, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.579. URL https://aclanthology.org/2020.coling-main.579.
- Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
- Jonathan Dunn. Mapping languages: The corpus of global language use. Language Resources and Evaluation, 54:999–1018, 2020.
- Geographically-informed language identification. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7672–7682, Torino, Italia, May 2024. ELRA and ICCL. URL https://aclanthology.org/2024.lrec-main.678.
- Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=vfT4YuzAYA.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
- Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
- Glottolog 5.0. Max Planck Institute for Evolutionary Anthropology, 2024. doi: 10.5281/zenodo.10804357. URL http://glottolog.org. Available online at http://glottolog.org, Accessed on 2024-06-01.
- Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.61.
- Wanca in korp: Text corpora for underresourced Uralic languages. In Proceedings of the Research data and humanities (RDHUM) 2019 conference. University of Oulu, 2019a.
- Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65:675–782, 2019b.
- HeLI-OTS, off-the-shelf language identifier for text. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3912–3922, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.416.
- The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.560. URL https://aclanthology.org/2020.acl-main.560.
- FastText.zip: Compressing text classification models, 2016. URL https://arxiv.org/abs/1612.03651.
- Incorporating dialectal variability for socially equitable language identification. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 51–57, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-2009. URL https://aclanthology.org/P17-2009.
- GlotLID: Language identification for low-resource languages. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6155–6218, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.410. URL https://aclanthology.org/2023.findings-emnlp.410.
- MEXA: Multilingual evaluation of english-centric LLMs via cross-lingual alignment, 2024a. URL https://arxiv.org/abs/2410.05873.
- MaskLID: Code-switching language identification through iterative masking. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 459–469, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-short.43. URL https://aclanthology.org/2024.acl-short.43.
- GlotScript: A resource and tool for low resource writing system identification. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7774–7784, Torino, Italia, May 2024c. ELRA and ICCL. URL https://aclanthology.org/2024.lrec-main.687.
- Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 01 2022. ISSN 2307-387X. doi: 10.1162/tacl_a_00447. URL https://doi.org/10.1162/tacl_a_00447.
- Madlad-400: A multilingual and document-level large audited dataset. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 67284–67296. Curran Associates, Inc., 2023. URL https://openreview.net/pdf?id=Y45ZCxslFx.
- Preparing the vuk’uzenzele and ZA-gov-multilingual South African multilingual corpora. In Rooweither Mabuya, Don Mthobela, Mmasibidi Setaka, and Menno Van Zaanen, editors, Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023), pages 18–25, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.rail-1.3. URL https://aclanthology.org/2023.rail-1.3.
- The bigscience roots corpus: A 1.6tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826, 2022.
- Bloom library: Multimodal datasets in 300+ languages for a variety of downstream tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8608–8621, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.590. URL https://aclanthology.org/2022.emnlp-main.590.
- DataComp-LM: In search of the next generation of training sets for language models. arXiv preprint arXiv:2406.11794, 2024.
- langid.py: An off-the-shelf language identification tool. In Min Zhang, editor, Proceedings of the ACL 2012 System Demonstrations, pages 25–30, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL https://aclanthology.org/P12-3005.
- Bhasha-abhijnaanam: Native-script and romanized language identification for 22 indic languages, 2023.
- Shervin Malmasi. Open-set language identification. arXiv preprint arXiv:1707.04817, 2017.
- Michael McCandless. Accuracy and performance of google’s compact language detector. Blog post, 2010.
- AI Meta. Introducing Meta Llama 3: The most capable openly available LLM to date. 2024. URL https://ai.meta.com/blog/meta-llama-3.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229, 2019.
- Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023.
- Universal Dependencies v2: An evergrowing multilingual treebank collection. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France, May 2020. European Language Resources Association. URL https://aclanthology.org/2020.lrec-1.497.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.
- Afriqa: Cross-lingual open-retrieval question answering for african languages, 2023.
- OpenAI. GPT4-o (May 13 version), 2024. URL https://openai.com/index/hello-gpt-4o/.
- Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
- Datatrove: large scale data processing, 2024a. URL https://github.com/huggingface/datatrove.
- The FineWeb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557, 2024b.
- Fabrice Prigent. UT1 Blacklists, 2024. URL https://github.com/olbat/ut1-blacklists.
- Scaling language models: Methods, analysis & insights from training gopher, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Compact language detector v3. 2018. URL https://chromium.googlesource.com/external/github.com/google/cld_3/.
- Nakatani Shuyo. Language detection library for java, 2010. URL https://github.com/shuyo/language-detection.
- AI4D–African Language Program. arXiv preprint arXiv:2104.02516, 2021.
- Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024.
- Open-set recognition: A good closed-set classifier is all you need. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=5hLP5JY9S2d.
- Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 4003–4012, 2020.
- NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages, 2022.
- Caswell Wormer. Fun LangID, 2023. URL https://github.com/google-research/url-nlp/tree/main/fun_langid.
- Titus Wormer. Franc Library, 2014. URL https://github.com/wooorm/franc.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
- Machine translation for low-resource Finno-Ugric languages. In Tanel Alumäe and Mark Fishel, editors, Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 762–771, Tórshavn, Faroe Islands, May 2023. University of Tartu Library. URL https://aclanthology.org/2023.nodalida-1.77.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.