Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions (2402.18025v2)
Abstract: How can LLMs process and translate endangered languages? Many languages lack a large corpus to train a decent LLM; therefore existing LLMs rarely perform well in unseen, endangered languages. On the contrary, we observe that 2000 endangered languages, though without a large corpus, have a grammar book or a dictionary. We propose LINGOLLM, a training-free approach to enable an LLM to process unseen languages that hardly occur in its pre-training. Our key insight is to demonstrate linguistic knowledge of an unseen language in an LLM's prompt, including a dictionary, a grammar book, and morphologically analyzed input text. We implement LINGOLLM on top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks across 8 endangered or low-resource languages. Our results show that LINGOLLM elevates translation capability from GPT-4's 0 to 10.5 BLEU for 10 language directions. Our findings demonstrate the tremendous value of linguistic knowledge in the age of LLMs for endangered languages. Our data, code, and model generations can be found at https://github.com/LLiLab/LLM4endangeredlang.
- Mega: Multilingual evaluation of generative ai. arXiv preprint arXiv:2303.12528.
- Valentina Alfarano. 2021. Grammaire du Nalögo, langue océanienne de l’île Santa Cruz (Archipel des îles Salomon). Ph.d. dissertation, Institut National des Langues et Civilisations Orientales- INALCO PARIS - LANGUES O’. French. NNT: 2021INAL0020. tel-03421587.
- Building machine translation systems for the next thousand languages.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Andrew Cowell and Alonzo Moss. 2008. The Arapaho Language. University Press of Colorado.
- Findings of the AmericasNLP 2023 shared task on machine translation into indigenous languages. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 206–219, Toronto, Canada. Association for Computational Linguistics.
- Sofía Flores-Solórzano. 2019. The modeling of bribri verbal morphology. Natural Language Processing, 62(0):85–92.
- An FST morphological analyzer for the gitksan language. In Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 188–197, Online. Association for Computational Linguistics.
- How to design translation prompts for chatgpt: An empirical study. arXiv e-prints, pages arXiv–2304.
- Findings of the SIGMORPHON 2023 shared task on interlinear glossing. In Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 186–201, Toronto, Canada. Association for Computational Linguistics.
- Gitksan Research Lab. 2023. Gitksan. https://mothertongues.org/gitksan/.
- Liliya M Gorelova. 2002. Manchu grammar. Brill Academic Publishers.
- The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
- How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting. arXiv preprint arXiv:2305.07004.
- Mans Hulden. 2009. Foma: a finite-state compiler and library. In Proceedings of the Demonstrations Session at EACL 2009, pages 29–32, Athens, Greece. Association for Computational Linguistics.
- C.V. Jara. 2018. Gramática de la lengua bribri. éditeur non identifié.
- Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Creating lexical resources for polysynthetic languages—the case of Arapaho. In Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 10–18, Honolulu. Association for Computational Linguistics.
- Krohn, H. S. 2023. Bribri–spanish spanish–bribri dictionary. https://www.haakonkrohn.com/bribri/bri-esp.html/.
- A neural morphological analyzer for Arapaho verbs learned from a finite state transducer. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages, pages 12–20, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Christopher Moseley. 2010. Atlas of the World’s Languages in Danger. Unesco.
- Sebastian Nordhoff and Harald Hammarström. 2011. Glottolog/langdoc: Defining dialects, languages, and language families as collections of resources. In First International Workshop on Linked Science 2011-In conjunction with the International Semantic Web Conference (ISWC 2011).
- Jerry Norman. 2020. A comprehensive Manchu-English dictionary. BRILL.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Peace Corps The Gambia. 1995. Wollof - english dictionary. https://resourcepage.gambia.dk/ftp/wollof.pdf.
- Bruce Rigsby. 1986. Gitksan Grammar. University of Queensland, Australia.
- Chatgpt mt: Competitive for high-(but not low-) resource languages. arXiv preprint arXiv:2309.07423.
- Buleku. https://buleku.org. Accessed: date-of-access.
- Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations.
- A benchmark for learning to translate a new language from one grammar book.
- No language left behind: Scaling human-centered machine translation.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- William A. Stewart. 1970. Notes on wolof grammar by william a. stewart adapted for the present text by william w. gage. http://wolofresources.org/language/download/stewart_notes.pdf.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint arXiv:2304.04675.