Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What is "Typological Diversity" in NLP? (2402.04222v4)

Published 6 Feb 2024 in cs.CL

Abstract: The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world's languages. Aiming to extend this, an increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is commonly used to motivate language selection, on the basis that a broad typological sample ought to imply generalization across a broad range of languages. These selections are often described as being 'typologically diverse'. In this work, we systematically investigate NLP research that includes claims regarding 'typological diversity'. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the diversity of language selection along several axes and find that the results vary considerably across papers. Crucially, we show that skewed language selection can lead to overestimated multilingual performance. We recommend future work to include an operationalization of 'typological diversity' that empirically justifies the diversity of language samples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Antonis Anastasopoulos. 2019. A note on evaluating multilingual benchmarks. Accessed: 2024-01-09.
  2. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics.
  3. Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
  4. MIA 2022 shared task: Evaluating cross-lingual open-retrieval question answering for 16 diverse languages. In Proceedings of the Workshop on Multilingual Information Access (MIA), pages 108–120, Seattle, USA. Association for Computational Linguistics.
  5. Making more of little data: Improving low-resource automatic speech recognition using data augmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 715–729, Toronto, Canada. Association for Computational Linguistics.
  6. Multilingual gradient word-order typology from Universal Dependencies. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL).
  7. Steven Bird. 2020. Decolonising speech and language technology. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3504–3519, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  8. Entity Linking in 100 Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7833–7845, Online. Association for Computational Linguistics.
  9. Learning and evaluating emotion lexicons for 91 languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1202–1217, Online. Association for Computational Linguistics.
  10. IGLUE: A benchmark for transfer learning across modalities, tasks, and languages. In International Conference on Machine Learning, pages 2370–2392. PMLR.
  11. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  12. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  13. Michael Cysouw. 2013. Disentangling geography from genealogy. In Space in Language and Linguistics, pages 21–37. De Gruyter.
  14. MasakhaPOS: Part-of-speech tagging for typologically diverse African languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10883–10900, Toronto, Canada. Association for Computational Linguistics.
  15. Matthew S. Dryer. 2013. Order of subject, object and verb (v2020.3). In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Zenodo.
  16. Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online (v2020.3). Max Planck Institute for Evolutionary Anthropology, Leipzig.
  17. MorphAGram, evaluation and framework for unsupervised morphological segmentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 7112–7122, Marseille, France. European Language Resources Association.
  18. Unsupervised stem-based cross-lingual part-of-speech tagging for morphologically rich low-resource languages. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4061–4072, Seattle, United States. Association for Computational Linguistics.
  19. Unsupervised cross-lingual part-of-speech tagging for truly low-resource scenarios. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4820–4831, Online. Association for Computational Linguistics.
  20. MASSIVE: A 1M-example multilingual natural language understanding dataset with 51 typologically-diverse languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4277–4302, Toronto, Canada. Association for Computational Linguistics.
  21. On the relation between linguistic typology and (limitations of) multilingual language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 316–327, Brussels, Belgium. Association for Computational Linguistics.
  22. Findings of the SIGMORPHON 2023 shared task on interlinear glossing. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 186–201, Toronto, Canada. Association for Computational Linguistics.
  23. XHate-999: Analyzing and detecting abusive language across domains and languages. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6350–6365, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  24. An unsupervised, geometric and syntax-aware quantification of polysemy. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10565–10574, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  25. SIGMORPHON–UniMorph 2023 shared task 0: Typologically diverse morphological inflection. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 117–125, Toronto, Canada. Association for Computational Linguistics.
  26. Joseph Harold Greenberg. 1963. Some universals of grammar with particular reference to the order of meaningful elements. In Universals of Language, pages 40–70. MIT press, Cambridge, MA.
  27. Cross-lingual knowledge distillation for answer sentence selection in low-resource languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14078–14092, Toronto, Canada. Association for Computational Linguistics.
  28. From characters to words: the turning point of BPE merges. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3454–3468, Online. Association for Computational Linguistics.
  29. MultiTACRED: A multilingual version of the TAC relation extraction dataset. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3785–3801, Toronto, Canada. Association for Computational Linguistics.
  30. Peter Juel Henrichsen and Marcus Uneson. 2012. SMALLWorlds – multilingual content-controlled monologues. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 3362–3368, Istanbul, Turkey. European Language Resources Association (ELRA).
  31. Multi2WOZ: A robust multilingual dataset and conversational pretraining for task-oriented dialog. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3687–3703, Seattle, United States. Association for Computational Linguistics.
  32. The ACQDIV corpus database and aggregation pipeline. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 156–165, Marseille, France. European Language Resources Association.
  33. X-FACTR: Multilingual factual knowledge retrieval from pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5943–5959, Online. Association for Computational Linguistics.
  34. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  35. Weakly Supervised POS Taggers Perform Poorly on Truly Low-Resource Languages. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8066–8073.
  36. SIGMORPHON–UniMorph 2022 shared task 0: Generalization and typologically diverse morphological inflection. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 176–203, Seattle, Washington. Association for Computational Linguistics.
  37. Why we need a gradient approach to word order. Linguistics, 61(4):825–883.
  38. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315–7330, Online. Association for Computational Linguistics.
  39. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 8–14, Valencia, Spain. Association for Computational Linguistics.
  40. Visually grounded reasoning across languages and cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  41. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389–1406.
  42. Manual clustering and spatial arrangement of verbs for multilingual evaluation and typology analysis. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4810–4824, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  43. Sampling for variety. Linguistic Typology, 20(2):233–296.
  44. Morphological segmentation for low resource languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3996–4002, Marseille, France. European Language Resources Association.
  45. Saliha Muradoglu and Mans Hulden. 2022. Eeny, meeny, miny, moe. how to choose data for morphological inflection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7294–7303, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  46. Johanna Nichols. 1996. The comparative method as heuristic. In Mark Durie, editor, The Comparative Method Reviewed, chapter 3, pages 39–71. Oxford University Press, New York.
  47. Universal dependencies 2.2.
  48. Using neural machine translation for generating diverse challenging exercises for language learner. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6115–6129, Toronto, Canada. Association for Computational Linguistics.
  49. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  50. xGQA: Cross-lingual visual question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2497–2511, Dublin, Ireland. Association for Computational Linguistics.
  51. Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  52. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. Association for Computational Linguistics.
  53. Jan Rijkhoff and Dik Bakker. 1998. Language sampling. Linguistic Typology, 2(3):263–314.
  54. A method of language sampling. Studies in Language. International Journal sponsored by the Foundation “Foundations of Language”, 17(1):169–203.
  55. LAReQA: Language-agnostic answer retrieval from a multilingual pool. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5919–5930, Online. Association for Computational Linguistics.
  56. XTREME-R: Towards more challenging and nuanced multilingual evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10215–10245, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  57. Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2340–2354, Dublin, Ireland. Association for Computational Linguistics.
  58. How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3118–3135, Online. Association for Computational Linguistics.
  59. Farhan Samir and Miikka Silfverberg. 2023. Understanding compositional data augmentation in typologically diverse morphological inflection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 277–291, Singapore. Association for Computational Linguistics.
  60. DivEMT: Neural machine translation post-editing effort across typologically diverse languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7795–7816, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  61. Co-reference annotation and resources: A multilingual corpus of typologically diverse languages. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain. European Language Resources Association (ELRA).
  62. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations (ICLR).
  63. Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances, 9(16).
  64. A database for measuring linguistic information content. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 967–974, Reykjavik, Iceland. European Language Resources Association (ELRA).
  65. Anirudh Srinivasan and Eunsol Choi. 2022. TyDiP: A dataset for politeness classification in nine typologically diverse languages. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5723–5738, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  66. A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1105–1116, Hong Kong, China. Association for Computational Linguistics.
  67. Multi-SimLex: A large-scale evaluation of multilingual and crosslingual lexical semantic similarity. Computational Linguistics, 46(4):847–897.
  68. SIGMORPHON 2020 shared task 0: Typologically diverse morphological inflection. In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 1–39, Online. Association for Computational Linguistics.
  69. Modeling morphological typology for unsupervised learning of language morphology. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6672–6681, Online. Association for Computational Linguistics.
  70. Cross-linguistic syntactic difference in multilingual BERT: How good is it and how does it affect transfer? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8073–8092, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  71. SLABERT talk pretty one day: Modeling second language acquisition with BERT. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11763–11777, Toronto, Canada. Association for Computational Linguistics.
  72. Mr. TyDi: A multi-lingual benchmark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 127–137, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  73. MIRACL: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.
Citations (1)

Summary

  • The paper critically evaluates typological diversity claims in NLP literature and introduces metrics like mean pairwise syntactic distance (MPSD) to assess language selection.
  • It systematically reviews multilingual studies to reveal discrepancies between language sampling practices and actual linguistic diversity.
  • The authors recommend standardizing language selection reporting to prevent skewed generalizations and improve the generalizability of multilingual models.

Introduction

The NLP research community predominantly focuses on English, while multilingual NLP has been a secondary concern. Recently, there has been a shift to encompass more languages, with an emphasis on evaluating multilingual model performance across a suitably diverse sample of the world's languages. Considering linguistic diversity is thought to imply robust generalizability, yet the lack of a clear definition of 'typological diversity' in NLP research has been problematic. This paper provides a vital critique of how 'typological diversity' claims are substantiated in NLP literature and offers metrics for assessing the diversity of language selection along several axes.

Survey Methodology

The paper presents a comprehensive analysis of the use of 'typological diversity' in NLP research. To conduct their investigation, the authors formulated a set of metrics to evaluate and define 'typological diversity' across a series of studies. They surveyed NLP literature for claims of typological diversity and scrutinized the language sample justification provided in these studies, if any. Inter-annotator agreement was employed to ensure consistency in scoping the claims about linguistic diversity and dataset introduction. The systematic review covered well-known conferences and journals, and multiple justifications for typological diversity claims were annotated and discussed.

Analysis of Language Diversity

In assessing language diversity, the authors revealed significant variation in 'typological diversity' claims across papers. They proposed using mean pairwise syntactic distance (MPSD) and typological feature inclusion as approximate metrics. The data suggested a discrepancy between multilingual model evaluation and real-world linguistic diversity due to skewed language selections, which often lead to overestimated multilingual performance. The analysis demonstrated that simply adding more languages to a paper does not necessarily increase its typological diversity; rather, researchers should consider language selection more carefully to improve the generalizability of their findings.

Recommendations and Implications

The authors advocate for future research to incorporate a defined operational method for typological diversity to avoid skewed generalizations. They recommend documenting language selection and employing measures such as MPSD or typological feature inclusion, enhancing our understanding of linguistic challenges in multilingual NLP modelling. Additionally, it is noted that the developed metrics and tools are approximations, given the incomplete linguistic resources and understanding. The authors proceed with an ethical perspective, emphasizing that expanding NLP applications to numerous languages is not an inherently positive aim without considering the sociocultural impact.

In conclusion, the paper highlights the issue of unsubstantiated 'typological diversity' claims in multilingual NLP literature and the potential pitfalls in assuming generalizability when such diversity is assumed but not empirically justified. It emphasizes the importance of principled reporting on linguistic diversity and the need to refine methodologies for claiming typological diversity. Not only does this have implications for the accuracy of multilingual model performance, but it also affects our understanding of the actual diversity present in the data that such models are tested against.

Github Logo Streamline Icon: https://streamlinehq.com