A Tulu Resource for Machine Translation
Abstract: We present the first parallel dataset for English-Tulu translation. Tulu, classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India. Our dataset is constructed by integrating human translations into the multilingual machine translation resource FLORES-200. Furthermore, we use this dataset for evaluation purposes in developing our English-Tulu machine translation model. For the model's training, we leverage resources available for related South Dravidian languages. We adopt a transfer learning approach that exploits similarities between high-resource and low-resource languages. This method enables the training of a machine translation system even in the absence of parallel data between the source and target language, thereby overcoming a significant obstacle in machine translation development for low-resource languages. Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points (in September 2023). The dataset and code are available here: https://github.com/manunarayanan/Tulu-NMT.
- Unsupervised neural machine translation. In International Conference on Learning Representations.
- Giuseppe Attardi. 2015. WikiExtractor. https://github.com/attardi/wikiextractor.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Improving multilingual neural machine translation system for indic languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 22(6).
- Steven Bird. 2020. Decolonising speech and language technology. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3504ā3519, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Steven Bird. 2022. Local languages, third spaces, and other high-resource scenarios. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7817ā7829, Dublin, Ireland. Association for Computational Linguistics.
- What do dialect speakers want? a survey of attitudes towards language technology for german dialects.
- J.Ā Brigel. 1982. A Grammar of the Tulu Language. Asian Educational Services.
- YANMTT: Yet another neural machine translation toolkit. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 257ā263, Toronto, Canada. Association for Computational Linguistics.
- IndicBART: A pre-trained model for indic natural language generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1849ā1863, Dublin, Ireland. Association for Computational Linguistics.
- Ethnologue: Languages of the World, twenty-sixth edition. SIL International, Dallas, Texas. Online version: http://www.ethnologue.com.
- A theoretical analysis of the repetition problem in text generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(14):12848ā12856.
- Translation techies @DravidianLangTech-ACL2022-machine translation in Dravidian languages. In Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages, pages 120ā124, Dublin, Ireland. Association for Computational Linguistics.
- IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948ā4961, Online. Association for Computational Linguistics.
- OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pages 67ā72, Vancouver, Canada. Association for Computational Linguistics.
- Adapting high-resource NMT models to translate low-resource related languages without parallel data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 802ā812, Online. Association for Computational Linguistics.
- Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28ā39, Vancouver. Association for Computational Linguistics.
- Bhadriraju Krishnamurti. 2003. The Dravidian Languages. Cambridge Language Surveys. Cambridge University Press.
- Low resource neural machine translation: A benchmark for five african languages. ArXiv, abs/2003.14402.
- Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations.
- Richard Littauer and Hugh PatersonĀ III. 2016. Open source code serving endangered languages. In Proceedings of LREC 2016 Collaboration and Computing for Under-Resourced Languages: Towards an Alliance for Digital Language Diversity (CCURL) Workshop, pages 86ā88, Portorož, Slovenia.
- Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726ā742.
- Not always about you: Prioritizing community needs when developing endangered language technology. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3933ā3944, Dublin, Ireland. Association for Computational Linguistics.
- Designing language technologies for social good: The road not taken.
- Office of the Registrar General & Census Commissioner, India. 2022. Distribution of Kannada Speakers 2011, pages 38ā39. Government of India.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311ā318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186ā191, Brussels, Belgium. Association for Computational Linguistics.
- Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86ā96, Berlin, Germany. Association for Computational Linguistics.
- SanfordĀ B. Steever. 2017. The dravidian language family. In The Cambridge Handbook of Linguistic Typology, Cambridge Handbooks in Language and Linguistics, page 887ā910. Cambridge University Press.
- S.B. Steever, editor. 2019. The Dravidian Languages, 2nd edition. Routledge.
- P.S. Subrahmanyam. 2006. Dravidian languages. In Keith Brown, editor, Encyclopedia of Language & Linguistics (Second Edition), second edition edition, pages 785ā795. Elsevier, Oxford.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, volumeĀ 27. Curran Associates, Inc.
- PathiĀ Venkata Thadhagath. 2023. Demand to make tulu second official language of karnataka arises in assembly. Hindustan Times.
- Attention is all you need. In Advances in Neural Information Processing Systems, volumeĀ 30. Curran Associates, Inc.
- Wikipedia. 2023. Tulu Wikipedia ā Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Tulu%20Wikipedia&oldid=1157462605. [Online; accessed 29-May-2023].
- Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568ā1575, Austin, Texas. Association for Computational Linguistics.
- The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522ā538.
- Overview of the shared task on machine translation in Dravidian languages. In Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages, pages 271ā278, Dublin, Ireland. Association for Computational Linguistics.
- No language left behind: Scaling human-centered machine translation.
- Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics, 10:145ā162.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.