Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets (2403.03909v2)

Published 6 Mar 2024 in cs.CL

Abstract: Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP. Linguistic diversity of these data sets is typically measured as the number of languages or language families included in the sample, but such measures do not consider structural properties of the included languages. In this paper, we propose assessing linguistic diversity of a data set against a reference language sample as a means of maximising linguistic diversity in the long run. We represent languages as sets of features and apply a version of the Jaccard index suitable for comparing sets of measures. In addition to the features extracted from typological data bases, we propose an automatic text-based measure, which can be used as a means of overcoming the well-known problem of data sparsity in manually collected features. Our diversity score is interpretable in terms of linguistic features and can identify the types of languages that are not represented in a data set. Using our method, we analyse a range of popular multilingual data sets (UD, Bible100, mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, XQuAD). In addition to ranking these data sets, we find, for example, that (poly)synthetic languages are missing in almost all of them.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Rami Al-Rfou. 2015. Polyglot: A massive multilingual natural language processing pipeline. Ph.D. thesis, State University of New York at Stony Brook.
  2. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics.
  3. Emily M. Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604.
  4. A comparison between morphological complexity measures: typological data vs. language corpora. In Proceedings of the workshop on computational linguistics for linguistic complexity (cl4lc), pages 142–153.
  5. Using universal dependencies in cross-linguistic complexity research. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 8–17.
  6. The autotyp typological databases. version 0.1.2.
  7. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  8. Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the bible in 100 languages. Language Resources and Evaluation, 49(2):375–395.
  9. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  10. Introduction. In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  11. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  12. Make the best of cross-lingual transfer: Evidence from POS tagging with over 100 languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7676–7685, Dublin, Ireland. Association for Computational Linguistics.
  13. Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  14. Gregory Grefenstette. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, USA.
  15. Glottolog 3.3. Leipzig.
  16. Martin Haspelmath. 2007. Pre-established categories don’t exist: Consequences for language description and typology. Linguistic Typology, 11(1):119–132.
  17. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR.
  18. P. Jaccard. 1912. The distribution of the flora in the alpine zone.1. New Phytologist, 11:37–50.
  19. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  20. Kimmo Kettunen. 2014. Can type-token ratio be used to show morphological complexity of languages? Journal of Quantitative Linguistics, 21(3):223–245.
  21. Quality at a glance: An audit of web-crawled multilingual datasets.
  22. From zero to hero: On the limitations of zero-shot language transfer with multilingual Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483–4499, Online. Association for Computational Linguistics.
  23. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  24. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6008–6018, Online. Association for Computational Linguistics.
  25. Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3125–3135, Florence, Italy. Association for Computational Linguistics.
  26. Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. In Proceedings from LREC 2016, pages 923–929. European Language Resources Association.
  27. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 8–14, Valencia, Spain. Association for Computational Linguistics.
  28. Thomas Mayer and Michael Cysouw. 2014. Creating a massively parallel bible corpus. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 3158–3163.
  29. Steven Moran. 2016. The ACQDIV database: Min(d)ing the ambient language. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4423–4429, Portorož, Slovenia. European Language Resources Association (ELRA).
  30. TeDDi sample: Text data diversity sample for language comparison and multilingual NLP. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1150–1158, Marseille, France. European Language Resources Association.
  31. Steven Moran and Michael Cysouw. 2018. The Unicode cookbook for linguists. Number 10 in Translation and Multilingual Natural Language Processing. Language Science Press, Berlin.
  32. Universal Dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France. European Language Resources Association.
  33. Subword evenness (SuE) as a predictor of cross-lingual transfer to low-resource languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7428–7445, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  34. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
  35. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. Association for Computational Linguistics.
  36. Modeling language variation and universals: A survey on typological linguistics for natural language processing. Computational Linguistics, 45(3):559–601.
  37. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  38. On language spaces, scales and cross-lingual transfer of UD parsers. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 266–281, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  39. Richard Sproat and Alexander Gutkin. 2021. The Taxonomy of Writing Systems: How to Measure How Logographic a System Is. Computational Linguistics, 47(3):477–528.
  40. Sabine Stoll and Balthasar Bickel. 2013. Capturing diversity in language acquisition research. In Balthasar Bickel, Lenore A. Grenoble, David A. Peterson, and Alan Timberlake, editors, Language typology and historical contingency: studies in honor of Johanna Nichols, pages 195–260. Benjamins, Amsterdam. [pre-print available at http://www.psycholinguistics.uzh.ch/stoll/publications/stollbickel.sampling2012rev.pdf].
  41. T. T Tanimoto. 1958. An elementary mathematical theory of classification and prediction. International Business Machines Corporation.
  42. Parsing morphologically rich languages: Introduction to the special issue. Computational Linguistics, 39(1):15–22.
  43. Revisiting the primacy of english in zero-shot cross-lingual transfer.
  44. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  45. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
Citations (4)

Summary

We haven't generated a summary for this paper yet.