A Tulu Resource for Machine Translation (2403.19142v1)
Abstract: We present the first parallel dataset for English-Tulu translation. Tulu, classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India. Our dataset is constructed by integrating human translations into the multilingual machine translation resource FLORES-200. Furthermore, we use this dataset for evaluation purposes in developing our English-Tulu machine translation model. For the model's training, we leverage resources available for related South Dravidian languages. We adopt a transfer learning approach that exploits similarities between high-resource and low-resource languages. This method enables the training of a machine translation system even in the absence of parallel data between the source and target language, thereby overcoming a significant obstacle in machine translation development for low-resource languages. Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points (in September 2023). The dataset and code are available here: https://github.com/manunarayanan/Tulu-NMT.
- Unsupervised neural machine translation. In International Conference on Learning Representations.
- Giuseppe Attardi. 2015. WikiExtractor. https://github.com/attardi/wikiextractor.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Improving multilingual neural machine translation system for indic languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 22(6).
- Steven Bird. 2020. Decolonising speech and language technology. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3504–3519, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Steven Bird. 2022. Local languages, third spaces, and other high-resource scenarios. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7817–7829, Dublin, Ireland. Association for Computational Linguistics.
- What do dialect speakers want? a survey of attitudes towards language technology for german dialects.
- J. Brigel. 1982. A Grammar of the Tulu Language. Asian Educational Services.
- YANMTT: Yet another neural machine translation toolkit. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 257–263, Toronto, Canada. Association for Computational Linguistics.
- IndicBART: A pre-trained model for indic natural language generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1849–1863, Dublin, Ireland. Association for Computational Linguistics.
- Ethnologue: Languages of the World, twenty-sixth edition. SIL International, Dallas, Texas. Online version: http://www.ethnologue.com.
- A theoretical analysis of the repetition problem in text generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(14):12848–12856.
- Translation techies @DravidianLangTech-ACL2022-machine translation in Dravidian languages. In Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages, pages 120–124, Dublin, Ireland. Association for Computational Linguistics.
- IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961, Online. Association for Computational Linguistics.
- OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pages 67–72, Vancouver, Canada. Association for Computational Linguistics.
- Adapting high-resource NMT models to translate low-resource related languages without parallel data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 802–812, Online. Association for Computational Linguistics.
- Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
- Bhadriraju Krishnamurti. 2003. The Dravidian Languages. Cambridge Language Surveys. Cambridge University Press.
- Low resource neural machine translation: A benchmark for five african languages. ArXiv, abs/2003.14402.
- Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations.
- Richard Littauer and Hugh Paterson III. 2016. Open source code serving endangered languages. In Proceedings of LREC 2016 Collaboration and Computing for Under-Resourced Languages: Towards an Alliance for Digital Language Diversity (CCURL) Workshop, pages 86–88, Portorož, Slovenia.
- Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
- Not always about you: Prioritizing community needs when developing endangered language technology. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3933–3944, Dublin, Ireland. Association for Computational Linguistics.
- Designing language technologies for social good: The road not taken.
- Office of the Registrar General & Census Commissioner, India. 2022. Distribution of Kannada Speakers 2011, pages 38–39. Government of India.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
- Sanford B. Steever. 2017. The dravidian language family. In The Cambridge Handbook of Linguistic Typology, Cambridge Handbooks in Language and Linguistics, page 887–910. Cambridge University Press.
- S.B. Steever, editor. 2019. The Dravidian Languages, 2nd edition. Routledge.
- P.S. Subrahmanyam. 2006. Dravidian languages. In Keith Brown, editor, Encyclopedia of Language & Linguistics (Second Edition), second edition edition, pages 785–795. Elsevier, Oxford.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
- Pathi Venkata Thadhagath. 2023. Demand to make tulu second official language of karnataka arises in assembly. Hindustan Times.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Wikipedia. 2023. Tulu Wikipedia — Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Tulu%20Wikipedia&oldid=1157462605. [Online; accessed 29-May-2023].
- Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568–1575, Austin, Texas. Association for Computational Linguistics.
- The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
- Overview of the shared task on machine translation in Dravidian languages. In Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages, pages 271–278, Dublin, Ireland. Association for Computational Linguistics.
- No language left behind: Scaling human-centered machine translation.
- Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics, 10:145–162.