LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages (2405.08997v2)
Abstract: We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation). Using the LLM-RBMT paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.
- Jessie Little Doe Baird. 2016. Wopanaak language reclamation program: bringing the language home. Journal of Global Indigeneity, 2(2).
- Sparks of Artificial General Intelligence: Early experiments with GPT-4.
- Semmt: A semantic-based testing approach for machine translation systems. ACM Trans. Softw. Eng. Methodol., 31(2).
- Aakanksha Chowdhery et al. 2022. PaLM: Scaling Language Modeling with Pathways.
- SerafÃn M. Coronel-Molina and Teresa L. McCarty. 2016. Indigenous language revitalization in the americas.
- Explosion. 2024. Industrial-Strength Natural Language Processing. https://spacy.io/. Accessed: 3 Mar 2024.
- How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation.
- Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages. Machine Translation, 35(4):475–502.
- adaptmllm: Fine-tuning multilingual language models on low-resource languages with integrated LLM playgrounds. Inf., 14(12):638.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Christopher Moseley. 2010. Atlas of the World’s Languages in Danger. Unesco.
- MTEB: massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 2006–2029. Association for Computational Linguistics.
- OpenAI. 2023. GPT-4 Technical Report.
- OpenAI. 2024a. New and Improved Embedding Model. https://openai.com/blog/new-and-improved-embedding-model. Accessed: 3 Mar 2024.
- OpenAI. 2024b. New embedding models and API updates. https://openai.com/blog/new-embedding-models-and-api-updates. Accessed: 3 Mar 2024.
- Tommi A Pirinen. 2019. Workflows for kickstarting RBMT in virtually no-resource situation. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, pages 11–16, Dublin, Ireland. European Association for Machine Translation.
- Neural machine translation for low-resource languages: A survey. ACM Comput. Surv., 55(11):229:1–229:37.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- ChatGPT MT: Competitive for High- (but not Low-) Resource Languages.
- SIL International. 2024. 639 Identifier Documentation: mnr. Accessed: 11 Mar 2024.
- Sentsim: Crosslingual semantic evaluation of machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 3143–3156. Association for Computational Linguistics.
- Joshua Taylor and Timothy Kochem. 2022. Access and empowerment in digital language learning, maintenance, and revival: a critical literature review. Diaspora, Indigenous, and Minority Education, 16(4):234–245.
- Leveraging rule-based machine translation knowledge for under-resourced neural machine translation models. In Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks, MTSummit 2019, Dublin, Ireland, August 19-23, 2019, pages 125–133. European Association for Machine Translation.
- A similarity measure for indefinite rankings. ACM Trans. Inf. Syst., 28(4):20:1–20:38.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.