Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Validating and Exploring Large Geographic Corpora (2403.08198v1)

Published 13 Mar 2024 in cs.CL

Abstract: This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations. This result shows how standard corpus creation techniques can accidentally exclude under-represented populations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Željko Agić and Ivan Vulić. 2019. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 3204–3210. Association for Computational Linguistics.
  2. BNC Consortium. 2007. The British National Corpus, XML Edition. Oxford Text Archive.
  3. Ralf Brown. 2014. Non-linear mapping for improved identification of 1300+ languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 627–632.
  4. Paul Cook and Laurel Brinton. 2017. Building and Evaluating Web Corpora Representing National Varieties of English. Language Resources and Evaluation, 51(3):643–662.
  5. Mark Davies. 2008. The Corpus of Contemporary American English (COCA). BYU Corpora.
  6. Mark Davies. 2013. Corpus of Global Web-Based English: 1.9 billion words from speakers in 20 countries (GloWbE). BYU Corpora.
  7. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  8. Jonathan Dunn. 2018. Finding variants for construction-based dialectometry: A corpus-based approach to regional cxgs. Cognitive Linguistics, 29(2):275–311.
  9. Jonathan Dunn. 2019a. Global Syntactic Variation in Seven Languages: Toward a Computational Dialectology. Frontiers in Artificial Intelligence, 2:15.
  10. Jonathan Dunn. 2019b. Modeling Global Syntactic Variation in English Using Dialect Classification. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 42–53, Ann Arbor, Michigan. Association for Computational Linguistics.
  11. Jonathan Dunn. 2020. Mapping languages: the Corpus of Global Language Use. Language Resources and Evaluation, 54:999–1018.
  12. Jonathan Dunn. 2021. Representations of Language Varieties Are Reliable Given Corpus Similarity Measures. In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 28–38. Association for Computational Linguistics.
  13. Jonathan Dunn. 2023a. Syntactic variation across the grammar: Modelling a complex adaptive system. Frontiers in Complex Systems, 1.
  14. Jonathan Dunn. 2023b. Syntactic variation across the grammar: modelling a complex adaptive system. Frontiers in Complex Systems, 1.
  15. Jonathan Dunn and Benjamin Adams. 2020. Geographically-balanced Gigaword corpora for 50 language varieties. In Proceedings of the Language Resources and Evaluation Conference, pages 2528–2536. European Language Resources Association.
  16. Jonathan Dunn and Wikke Nijhof. 2022. Language identification for Austronesian languages. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6530–6539, Marseille, France. European Language Resources Association.
  17. Jonathan Dunn and Sidney Wong. 2022. Stability of syntactic dialect classification over space and time. In Proceedings of the 29th International Conference on Computational Linguistics, pages 26–36, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  18. Dataset geography: Mapping language data to language users. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3381–3411, Dublin, Ireland. Association for Computational Linguistics.
  19. Evaluating a topic modelling approach to measuring corpus similarity. In Proceedings of the Tenth International Conference on Language Resources and Evaluation), pages 273–279, Portorož, Slovenia. European Language Resources Association.
  20. Digital divisions of labor and informational magnetism: Mapping participation in wikipedia. Annals of the Association of American Geographers, 105(6):1158–1178.
  21. Sidney Greenbaum. 1996. Comparing English Worldwide: The International Corpus of English. Clarendon Press, Oxford.
  22. Mapping Lexical Dialect Variation in British English Using Twitter. Frontiers in Artificial Intelligence, 2:11.
  23. HeLI-OTS, off-the-shelf language identifier for text. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3912–3922, Marseille, France. European Language Resources Association.
  24. Incorporating dialectal variability for socially equitable language identification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 51–57, Vancouver, Canada. Association for Computational Linguistics.
  25. Adam Kilgarriff. 2001. Comparing Corpora. International Journal of Corpus Linguistics, 6(1):97–133.
  26. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  27. Haipeng Li and Jonathan Dunn. 2022. Corpus similarity measures remain robust across diverse languages. Lingua, 275(103377).
  28. Register variation remains stable across 60 languages. Corpus Linguistics and Linguistic Theory, 19(3):397–426.
  29. Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the International Conference on Language Resources and Evaluation, pages 923–929. European Language Resources Association.
  30. Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 553–561, Chiang Mai, Thailand. Asian Federation of Natural Language Processing.
  31. Matt Taddy. 2015. Document classification by inversion of distributed language representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 45–49, Beijing, China. Association for Computational Linguistics.
  32. Jörg Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the International Conference on Language Resources and Evaluation, page 2214–2218. European Language Resources Association.
  33. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Jonathan Dunn (28 papers)

Summary

We haven't generated a summary for this paper yet.