Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sheffield's Submission to the AmericasNLP Shared Task on Machine Translation into Indigenous Languages (2306.09830v1)

Published 16 Jun 2023 in cs.CL

Abstract: In this paper we describe the University of Sheffield's submission to the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages which comprises the translation from Spanish to eleven indigenous languages. Our approach consists of extending, training, and ensembling different variations of NLLB-200. We use data provided by the organizers and data from various other sources such as constitutions, handbooks, news articles, and backtranslations generated from monolingual data. On the dev set, our best submission outperforms the baseline by 11% average chrF across all languages, with substantial improvements particularly for Aymara, Guarani and Quechua. On the test set, we achieve the highest average chrF of all the submissions, we rank first in four of the eleven languages, and at least one of our submissions ranks in the top 3 for all languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Curso básico de bribri.
  2. Željko Agić and Ivan Vulić. 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210, Florence, Italy. Association for Computational Linguistics.
  3. David Brambila. 1976. Diccionario rarámuri-castellano (tarahumar). Obra Nacional de la buena Prensa.
  4. No data to crawl? monolingual corpus creation from PDF files of truly low-resource languages in Peru. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2914–2923, Marseille, France. European Language Resources Association.
  5. Development of a Guarani - Spanish parallel corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2629–2633, Marseille, France. European Language Resources Association.
  6. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  7. Rubén Cushimariano Romano and Richer C. Sebastián Q. 2008. Ñaantsipeta asháninkaki birakochaki. diccionario asháninka-castellano. versión preliminar. http://www.lengamer.org/publicaciones/diccionarios/.
  8. Maximiliano Duran. 2010. La lengua general de los incas. Accessed: : 2023-05-25.
  9. Findings of the AmericasNLP 2023 shared task on machine translation into indigenous languages. In Proceedings of the Third Workshop on Natural Language Processing for Indigenous Languages of the Americas. Association for Computational Linguistics.
  10. Margery Peña Enrique. 2005. Diccionario fraseológico bribri-espanol~ espanol-bribri, 2nd edn~. San Jose: Editorial de la Universidad de Costa Rica.[Google Scholar].
  11. Corpus creation and initial SMT experiments between Spanish and Shipibo-konibo. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 238–244, Varna, Bulgaria. INCOMA Ltd.
  12. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4210–4214, Portorož, Slovenia. European Language Resources Association (ELRA).
  13. Survey of low-resource machine translation. Computational Linguistics, 48(3):673–732.
  14. Cesar Iter and Zenobio Ortiz Cárdenas. 2019. Runasimita yachasun.
  15. Carla Victoria Jara Murillo. 1993. I ttè. historias bribris. San José: Editorial Universidad de Costa Rica.
  16. Carla Victoria Jara Murillo. 2018. Gramática de la lengua bribri.
  17. Lyrics translate. 2008. Lyrics translate. Accessed: : 2023-05-25.
  18. Probabilistic finite-state morphological segmenter for wixarika (huichol) language. Journal of Intelligent & Fuzzy Systems, 34(5):3081–3087.
  19. The wixarika-spanish parallel corpus the wixarika-spanish parallel corpus.(august 2018).
  20. Findings of the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 202–217, Online. Association for Computational Linguistics.
  21. The Johns Hopkins University Bible corpus: 1600+ tongues for typological exploration. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2884–2892, Marseille, France. European Language Resources Association.
  22. Elena Mihas. 2011. Añaani katonkosatzi parenini, El idioma del alto Perené. Milwaukee, WI: Clarks Graphics.
  23. A continuous improvement framework of machine translation for Shipibo-konibo. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, pages 17–23, Dublin, Ireland. European Association for Machine Translation.
  24. Oscar Moreno. 2021. The REPU CS’ Spanish–Quechua submission to the AmericasNLP 2021 shared task on open machine translation. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 241–247, Online. Association for Computational Linguistics.
  25. Carla Victoria Jara Murillo and Alí García Segura. 2013. Se’ttöbribri ie: Hablemos en bribri. Programa de Regionalización Interuniversitaria CONARE.
  26. Overcoming resistance: The normalization of an Amazonian tribal language. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, pages 1–13, Suzhou, China. Association for Computational Linguistics.
  27. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
  28. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  29. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  30. Sofía Flores Solórzano. 2017. Corpus oral pandialectal de la lengua bribri.
  31. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
  32. Jörg Tiedemann. 2020. The tatoeba translation challenge–realistic data sets for low resource and multilingual mt. arXiv preprint arXiv:2010.06354.
  33. The Helsinki submission to the AmericasNLP shared task. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 255–264, Online. Association for Computational Linguistics.
Citations (6)

Summary

We haven't generated a summary for this paper yet.