The Past, Present, and Future of Typological Databases in NLP (2310.13440v1)
Abstract: Typological information has the potential to be beneficial in the development of NLP models, particularly for low-resource languages. Unfortunately, current large-scale typological databases, notably WALS and Grambank, are inconsistent both with each other and with other sources of typological information, such as linguistic grammars. Some of these inconsistencies stem from coding errors or linguistic variation, but many of the disagreements are due to the discrete categorical nature of these databases. We shed light on this issue by systematically exploring disagreements across typological databases and resources, and their uses in NLP, covering the past and present. We next investigate the future of such work, offering an argument that a continuous view of typological features is clearly beneficial, echoing recommendations from linguistics. We propose that such a view of typology has significant potential in the future, including in LLMing in low-resource scenarios.
- Reconstructing native language typology from foreign language usage. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 21–29, Ann Arbor, Michigan. Association for Computational Linguistics.
- Johannes Bjerva. 2023. The role of typological feature prediction in NLP and linguistics. Computational Linguistics.
- Johannes Bjerva and Isabelle Augenstein. 2018. Tracking typological traits of Uralic languages in distributed language representations. In Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages, pages 76–86, Helsinki, Finland. Association for Computational Linguistics.
- A probabilistic generative model of linguistic typology. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1529–1540, Minneapolis, Minnesota. Association for Computational Linguistics.
- Uncovering probabilistic implications in typological knowledge bases. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3924–3930, Florence, Italy. Association for Computational Linguistics.
- What do language representations really represent? Computational Linguistics, 45(2):381–389.
- SIGTYP 2020 shared task: Prediction of typological features. In Proceedings of the Second Workshop on Computational Research in Linguistic Typology, pages 1–11, Online. Association for Computational Linguistics.
- Rochelle Choenni and Ekaterina Shutova. 2022. Investigating language relationships in multilingual sentence encoders through the lens of linguistic typology. Computational Linguistics, 48(3):635–672.
- Chinmay Choudhary. 2020. NUIG: Multitasking Self-attention based approach to SigTyp 2020 Shared Task. In Proceedings of the Second Workshop on Computational Research in Linguistic Typology. Association for Computational Linguistics.
- William Croft. 2002. Typology and Universals. Cambridge University Press.
- Hal Daumé III and Lyle Campbell. 2007. A bayesian model for discovering typological implications. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 65–72.
- Matthew S. Dryer. 2013. Order of subject, object and verb (v2020.3). In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Zenodo.
- Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
- Alexander Gutkin and Richard Sproat. 2020. NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task. In Proceedings of the Second Workshop on Computational Research in Linguistic Typology. Association for Computational Linguistics.
- Gerhard Jäger. 2020. Imputing typological values via phylogenetic inference. In Proceedings of the Second Workshop on Computational Research in Linguistic Typology. Association for Computational Linguistics.
- KMI-Panlingua-IITKGP at SIGTYP2020: Exploring rules and hybrid systems for automatic prediction of typological features. In Proceedings of the Second Workshop on Computational Research in Linguistic Typology. Association for Computational Linguistics.
- Why we need a gradient approach to word order. Linguistics, 61(4):825–883.
- Neural factor graph models for cross-lingual morphological tagging. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2653–2663, Melbourne, Australia. Association for Computational Linguistics.
- Learning language representations for typology prediction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2529–2535, Copenhagen, Denmark. Association for Computational Linguistics.
- Frans Plank. 2009. WALS values evaluated.
- Modeling language variation and universals: A survey on typological linguistics for natural language processing. Computational Linguistics, 45(3):559–601.
- Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances, 9(16):eadg6175.
- Same neurons, different languages: Probing morphosyntax in multilingual pre-trained models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1589–1598, Seattle, United States. Association for Computational Linguistics.
- Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems 20-Proceedings of the 2007 Conference.
- Udapter: Typology-based language adapters for multilingual dependency parsing and sequence labeling. Computational Linguistics, 48(3):555–592.
- Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task. In Proceedings of the Second Workshop on Computational Research in Linguistic Typology. Association for Computational Linguistics.
- Robert Östling and Murathan Kurfalı. 2023. Language Embeddings Sometimes Contain Typological Generalizations. Computational Linguistics, pages 1–49.
- Emi Baylor (4 papers)
- Esther Ploeger (10 papers)
- Johannes Bjerva (52 papers)