GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text (2403.06399v3)
Abstract: Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6\%. Our pretrained model and dataset are available on Hugging Face.
- A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics.
- MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Learning grammar specifications from igt: A case study of chintang. In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 43–53.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623, Virtual Event Canada. ACM.
- Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 74–83, Sofia, Bulgaria. Association for Computational Linguistics.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
- Using computational approaches to integrate endangered language legacy data into documentation corpora: Past experiences and challenges ahead. In Workshop on Computational Methods for Endangered Languages, Honolulu, Hawai’i, USA, volume 2, pages 24–30.
- Teacher perception of automatically extracted grammar concepts for L2 language learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3776–3793, Singapore. Association for Computational Linguistics.
- Using interlinear gloss texts to improve language description. Indian Linguistics, 82.
- Edith Coates. 2023. An ensembled encoder-decoder system for interlinear glossed text. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 217–221, Toronto, Canada. Association for Computational Linguistics.
- The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses. Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology & the Department of Linguistics of the University of Leipzig. Retrieved January, 28:2010.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Andrew Cowell. 2020. The arapaho lexical and text database. Department of Linguistics, University of Colorado. Boulder, CO.
- Glossy bytes: Neural glossing using subword encoding. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 222–229, Toronto, Canada. Association for Computational Linguistics.
- Ryan Alden Georgi. 2016. From Aari to Zulu: massively multilingual creation of language tools using interlinear glossed text. Ph.D. thesis.
- Michael Ginn. 2023. Sigmorphon 2023 shared task of interlinear glossing: Baseline model.
- Findings of the SIGMORPHON 2023 shared task on interlinear glossing. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 186–201, Toronto, Canada. Association for Computational Linguistics.
- Michael Ginn and Alexis Palmer. 2023. Robust generalization strategies for morpheme glossing in an endangered language documentation context. In Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, pages 89–98, Singapore. Association for Computational Linguistics.
- Leander Girrbach. 2023a. SIGMORPHON 2022 shared task on grapheme-to-phoneme conversion submission description: Sequence labelling for G2P. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 239–244, Toronto, Canada. Association for Computational Linguistics.
- Leander Girrbach. 2023b. Tü-CL at SIGMORPHON 2023: Straight-through gradient estimation for hard attention. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 171–185, Toronto, Canada. Association for Computational Linguistics.
- Glottolog 4.8.
- SigMoreFun submission to the SIGMORPHON shared task on interlinear glossing. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 209–216, Toronto, Canada. Association for Computational Linguistics.
- SimAlign: High quality word alignments without parallel training data using static and contextualized embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1627–1643, Online. Association for Computational Linguistics.
- From n-gram-based to CRF-based translation models. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 542–553, Edinburgh, Scotland. Association for Computational Linguistics.
- Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World’s Languages. Literary and Linguistic Computing, 25(3):303–319.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv:1907.11692 [cs].
- Marco Lui and Timothy Baldwin. 2012. langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, pages 25–30, Jeju Island, Korea. Association for Computational Linguistics.
- Maria Luisa Zubizarreta. 2023. Guarani corpus. https://guaranicorpus.usc.edu/index.html. Accessed: 2024-02-11.
- Angelina McMillan-Major. 2020. Automating Gloss Generation in Interlinear Glossed Text. Proceedings of the Society for Computation in Linguistics, 3(1):338–349. Publisher: University of Mass Amherst.
- The Atlas of Pidgin and Creole Language Structures. Oxford University Press, Oxford.
- The Survey of Pidgin and Creole Languages. Oxford University Press, Oxford. 3 volumes. Volume I: English-based and Dutch-based Languages; Volume II: Portuguese-based, Spanish-based, and French-based Languages. Volume III: Contact Languages Based on Languages From Africa, Australia, and the Americas.
- Sarah Moeller and Mans Hulden. 2018. Automatic Glossing in a Low-Resource Setting for Language Documentation. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages, pages 84–93, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- IGT2P: From interlinear glossed texts to paradigms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5251–5262, Online. Association for Computational Linguistics.
- Sebastian Nordhoff and Robert Forkel. 2023. IMTVault.
- Sebastian Nordhoff and Thomas Krämer. 2022. IMTVault: Extracting and enriching low-resource language interlinear glossed text from grammatical descriptions and typological survey articles. In Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, pages 17–25, Marseille, France. European Language Resources Association.
- Uralic Typological database - UraTyp.
- Shu Okabe and François Yvon. 2023a. LISN @ SIGMORPHON 2023 shared task on interlinear glossing. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 202–208, Toronto, Canada. Association for Computational Linguistics.
- Shu Okabe and François Yvon. 2023b. Towards multilingual interlinear morphological glossing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5958–5971, Singapore. Association for Computational Linguistics.
- Evaluating automation strategies in language documentation. In Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, pages 36–44.
- Computational strategies for reducing annotation effort in language documentation: A case study in creating interlinear texts for Uspanteko. Linguistic Issues in Language Technology, 3.
- How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
- Text Collections in Four Mayan Languages. Archived in The Archive of the Indigenous Languages of Latin America.
- Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- User-centric evaluation of OCR systems for kwak’wala. In Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 19–29, Remote. Association for Computational Linguistics.
- Lane Schwartz. 2022. Primum Non Nocere: Before working with Indigenous data, the ACL must confront ongoing colonialism. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 724–731, Dublin, Ireland. Association for Computational Linguistics.
- Language documentation twenty-five years on. Language, 94(4):e324–e345.
- mgpt: Few-shot learners go multilingual. Transactions of the Association for Computational Linguistics, 12:58–79.
- Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances, 9(16):eadg6175.
- Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13693–13696.
- Building and using language resources and infrastructure to develop e-learning programs for a minority language. Proceedings of the Joint 6th Workshop on NLP for Computer Assisted Language Learning and 2nd Workshop on NLP for Research on Language Acquisition at NoDaLiDa, 134:61–67.
- Extending multilingual BERT to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2649–2656, Online. Association for Computational Linguistics.
- Bloom: A 176b-parameter open-access multilingual language model.
- ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5397–5408, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Using interlinear glosses as pivot in low-resource multilingual machine translation. arXiv: Computation and Language.