Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages (2312.04764v1)

Published 8 Dec 2023 in cs.CL

Abstract: This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpora to include a large number of low-resource Indian languages and integrate the effort with our prior work with African and American-Indian languages to create corpora covering a large number of languages from across the world.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Kaushik Acharya. 2021. KaushikAcharya at SemEval-2021 task 9: Candidate generation for fact verification over tables. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1271–1275, Online. Association for Computational Linguistics.
  2. Ife Adebara and Muhammad Abdul-Mageed. 2022. Towards afrocentric NLP for African languages: Where we are and where we can go. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.
  3. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:1805.06297.
  4. Monolingual and parallel corpora for kangri low resource language. arXiv preprint arXiv:2103.11596.
  5. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  6. Rajendra Kumar Dash. 2020. Revitalizing endangered languages in india: Can public-private partnership (ppp) work. In 2nd International Conference on Social Sciences in the 21st Century, Seminar Paper.
  7. Permutation invariant strategy using transformer encoders for table understanding. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 788–800, Seattle, United States. Association for Computational Linguistics.
  8. George van Driem. 2001. Languages of the himalayas: an ethnolinguistic handbook of the greater himalayan region: containing an introduction to the symbiotic theory of language. (No Title).
  9. Beyond english-centric multilingual machine translation. arXiv preprint.
  10. Unsung challenges of building and deploying language technologies for low resource language communities. arXiv preprint arXiv:1912.03457.
  11. The state and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095.
  12. Benjamin Philip King. 2015. Practical Natural Language Processing for Low-Resource Languages. Ph.D. thesis.
  13. Adapting multilingual neural machine translation to unseen languages. arXiv preprint arXiv:1910.13998.
  14. James A Matisoff. 2003. Handbook of Proto-Tibeto-Burman: system and philosophy of Sino-Tibetan reconstruction. Univ of California Press.
  15. James A Matisoff. 2015. The Sino-Tibetan etymological dictionary and thesaurus. Regents of the University of California.
  16. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  17. Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55(11):1–37.
  18. Dated language phylogenies shed light on the ancestry of sino-tibetan. Proceedings of the National Academy of Sciences, 116(21):10317–10322.
  19. Natural language processing in ethiopian languages: Current state, challenges, and opportunities. arXiv preprint arXiv:2303.14406.
  20. Improving neural machine translation for low resource languages using mixed training: The case of ethiopian languages. In Mexican International Conference on Artificial Intelligence, pages 30–40. Springer.
  21. Low-resource neural machine translation improvement using source-side monolingual data. Applied Sciences, 13(2):1201.
  22. Parallel corpus for indigenous language translation: Spanish-mazatec and spanish-mixtec. arXiv preprint arXiv:2305.17404.
  23. Attention is all you need. Advances in neural information processing systems, 30.
  24. Progress in machine translation. Engineering.
  25. Multilingual neural machine translation for low resourced languages: Ometo-english. In 2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), pages 89–94. IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Atnafu Lambebo Tonja (27 papers)
  2. Melkamu Mersha (5 papers)
  3. Ananya Kalita (2 papers)
  4. Olga Kolesnikova (24 papers)
  5. Jugal Kalita (64 papers)