2000 character limit reached
First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages (2312.04764v1)
Published 8 Dec 2023 in cs.CL
Abstract: This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpora to include a large number of low-resource Indian languages and integrate the effort with our prior work with African and American-Indian languages to create corpora covering a large number of languages from across the world.
- Kaushik Acharya. 2021. KaushikAcharya at SemEval-2021 task 9: Candidate generation for fact verification over tables. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1271–1275, Online. Association for Computational Linguistics.
- Ife Adebara and Muhammad Abdul-Mageed. 2022. Towards afrocentric NLP for African languages: Where we are and where we can go. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.
- A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:1805.06297.
- Monolingual and parallel corpora for kangri low resource language. arXiv preprint arXiv:2103.11596.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- Rajendra Kumar Dash. 2020. Revitalizing endangered languages in india: Can public-private partnership (ppp) work. In 2nd International Conference on Social Sciences in the 21st Century, Seminar Paper.
- Permutation invariant strategy using transformer encoders for table understanding. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 788–800, Seattle, United States. Association for Computational Linguistics.
- George van Driem. 2001. Languages of the himalayas: an ethnolinguistic handbook of the greater himalayan region: containing an introduction to the symbiotic theory of language. (No Title).
- Beyond english-centric multilingual machine translation. arXiv preprint.
- Unsung challenges of building and deploying language technologies for low resource language communities. arXiv preprint arXiv:1912.03457.
- The state and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095.
- Benjamin Philip King. 2015. Practical Natural Language Processing for Low-Resource Languages. Ph.D. thesis.
- Adapting multilingual neural machine translation to unseen languages. arXiv preprint arXiv:1910.13998.
- James A Matisoff. 2003. Handbook of Proto-Tibeto-Burman: system and philosophy of Sino-Tibetan reconstruction. Univ of California Press.
- James A Matisoff. 2015. The Sino-Tibetan etymological dictionary and thesaurus. Regents of the University of California.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55(11):1–37.
- Dated language phylogenies shed light on the ancestry of sino-tibetan. Proceedings of the National Academy of Sciences, 116(21):10317–10322.
- Natural language processing in ethiopian languages: Current state, challenges, and opportunities. arXiv preprint arXiv:2303.14406.
- Improving neural machine translation for low resource languages using mixed training: The case of ethiopian languages. In Mexican International Conference on Artificial Intelligence, pages 30–40. Springer.
- Low-resource neural machine translation improvement using source-side monolingual data. Applied Sciences, 13(2):1201.
- Parallel corpus for indigenous language translation: Spanish-mazatec and spanish-mixtec. arXiv preprint arXiv:2305.17404.
- Attention is all you need. Advances in neural information processing systems, 30.
- Progress in machine translation. Engineering.
- Multilingual neural machine translation for low resourced languages: Ometo-english. In 2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), pages 89–94. IEEE.
- Atnafu Lambebo Tonja (27 papers)
- Melkamu Mersha (5 papers)
- Ananya Kalita (2 papers)
- Olga Kolesnikova (24 papers)
- Jugal Kalita (64 papers)