Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages (2405.05376v2)

Published 8 May 2024 in cs.CL

Abstract: A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations -- 11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages -- the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity than ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 26 of 34 translation directions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (118)
  1. The AMARA corpus: Building parallel language resources for the educational domain. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1856–1862, Reykjavik, Iceland. European Language Resources Association (ELRA).
  2. Roshna Abdulrahman and Hossein Hassani. 2022. A language model for spell checking of educational texts in Kurdish (Sorani). In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 189–198, Marseille, France. European Language Resources Association.
  3. A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics.
  4. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  5. MasakhaNER: Named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
  6. Majed M. Al-Jefri and Sabri A. Mahmoud. 2013. Context-sensitive arabic spell checker using context words and n-gram language models. In 2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences, pages 258–263.
  7. Mervyn C Alleyne. 1971. Acculturation and the cultural matrix of creolization. Pidginization and creolization of languages, 1971:169–186.
  8. Joe KYB Amoako. 1992. Ghanaian pidgin english: In search of synchronic, diachronic, and sociolinguistic evidence. Unpublished PhD dissertation). University of Florida at Gainsville.
  9. Massively multilingual neural machine translation in the wild: Findings and challenges.
  10. Jampatoisnli: A jamaican patois natural language inference dataset. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5307–5320.
  11. Cedric Audebert. 2017. The recent geodynamics of haitian migration in the americas: refugees or economic migrants? Revista Brasileira de Estudos de População, 34:55–71.
  12. C. Baissac. 1888. Le folk lore de l’He-Maurice (texte eréole et traduction française). Littératures populaires de toutes les nations. Maisonneuve et C. Leclere.
  13. Angela Bartens. 2021. The making of languages and new literacies: San andrés-providence creole with a view on jamaican and haitian. Lingüística y Literatura, 42(79):237–256.
  14. Parth Bhatt and Ingo Plag. 2012. The structure of creole words: Segmental, syllabic and morphological aspects, volume 505. Walter de Gruyter.
  15. Steven Bird and David Chiang. 2012. Machine translation for language preservation. In Proceedings of COLING 2012: Posters, pages 125–134, Mumbai, India. The COLING 2012 Organizing Committee.
  16. BitextEdit: Automatic bitext editing for improved low-resource machine translation. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1469–1485, Seattle, United States. Association for Computational Linguistics.
  17. Anne-Marie Brousseau. 2011. One substrate, two creoles. In Claire Lefebvre, editor, Creoles, their Substrates, and Language Typology, pages 105–153. John Benjamins, Amsterdam.
  18. Findings of the 2011 workshop on statistical machine translation. In Proceedings of the sixth workshop on statistical machine translation, pages 22–64.
  19. A surface-syntactic UD treebank for Naija. In Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), pages 13–24. Association for Computational Linguistics.
  20. Frederic G Cassidy. 1966. Multiple etymologies in jamaican creole. American Speech, 41(3):211–215.
  21. Yolanda Rivera Castillo and Nicholas Faraclas. 2006. The emergence of systems of lexical and grammatical tone and stress in caribbean and west african creoles. STUF - Language Typology and Universals, 59(2):148–169.
  22. Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the bible in 100 languages. Language resources and evaluation, 49:375–395.
  23. Raphaël Confiant. 2007. Dictionnaire créole martiniquais-français. (No Title).
  24. Michael L Conniff. 1983. Black labor on a white canal: West indians in panama, 1904-1980.
  25. A survey of multilingual neural machine translation. ACM Comput. Surv., 53(5).
  26. YANMTT: Yet another neural machine translation toolkit. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 257–263, Toronto, Canada. Association for Computational Linguistics.
  27. Raj Dabre and Aneerav Sukhoo. 2022a. Kreolmorisienmt: A dataset for mauritian creole machine translation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 22–29.
  28. Raj Dabre and Aneerav Sukhoo. 2022b. KreolMorisienMT: A dataset for mauritian creole machine translation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 22–29, Online only. Association for Computational Linguistics.
  29. A. de Saint-Quentin. 1872. Introduction à l’histoire de Cayenne: suivie d’un recueil de contes, fables et chansons en créole avec traduction en regard, notes et commentaires. J. Marchand.
  30. Michel DeGraff. 2003. Against creole exceptionalism. Language, 79(2):391–410.
  31. Michel DeGraff. 2005. Linguists’ most dangerous myth: The fallacy of creole exceptionalism. Language in society, 34(4):533–591.
  32. Dagmar Deuber and Lars Hinrichs. 2007. Dynamics of orthographic standardization in jamaican creole and nigerian pidgin. World Englishes, 26:22–47.
  33. CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pages 5960–5969, Online. Association for Computational Linguistics.
  34. XLEnt: Mining a large cross-lingual entity dataset with lexical-semantic-phonetic word alignment. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10424–10430, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  35. Language resources and tools for livonian. Eesti ja soome-ugri keeleteaduse ajakiri. Journal of Estonian and Finno-Ugric Linguistics, 13(1):13–36.
  36. Automatic spelling correction for resource-scarce languages using deep learning. In Proceedings of ACL 2018, Student Research Workshop, pages 146–152, Melbourne, Australia. Association for Computational Linguistics.
  37. Joseph Farquharson. 2012. The African lexis in Jamaican: Its linguistic and sociohistoircal significance. Ph.D. thesis, University of the West Indies.
  38. A. Fortier. 1895. Louisiana Folk-tales: In French Dialect and English Translation. Memoirs of the American Folk-Lore Society. American Folk-lore society.
  39. The diplomat rapid development speech mt system. In Proceedings of Machine Translation Summit VI: Systems, pages 261–262.
  40. Pre-training transformer decoder for end-to-end asr model with unpaired text data. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6543–6547. IEEE.
  41. Édouard Glissant. 2008. Creolization in the making of the americas. Caribbean Quarterly, 54:81 – 89.
  42. Ti Liv Kréyòl (Second edition).
  43. Jessica Heinzelman and Carol Waters. 2010. Crowdsourcing crisis information in disaster-affected Haiti. JSTOR.
  44. Glaude Herby. 2012. Creoloral (oral corpus with annotations).
  45. Anita Herzfeld. 1980. Limon creole and panamanian creole: comparison and contrast. Mid-America Linguistics Conference.
  46. On the influence of global warming on atlantic hurricane frequency. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 42(3):527–532.
  47. Ethics at the intersection of crisis translation and humanitarian innovation. Journal of Humanitarian Affairs, 1(3):23–32.
  48. Effective cross-lingual transfer of neural machine translation models without shared vocabularies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1246–1257, Florence, Italy. Association for Computational Linguistics.
  49. Abdul D Knowles. 2018. Case study: Preventing and resolving conflict between bahamian nationals and the haitian diaspora that reside in the bahamas. International Journal of Law and Public Administration, 1(2):65–73.
  50. Silvia Kouwenberg. 2008. The problem of multiple substrates: The case of Jamaican Creole. In Susanne Michaelis, editor, Roots of creole structures: Weighing the contribution of substrates and superstrates, pages 1–27. John Benjamins.
  51. Silvia Kouwenberg and Darlene Lacharité. 2004. Echoes of africa: Reduplication in caribbean creole and niger-congo languages. Journal of Pidgin and Creole Languages, 19:285–331.
  52. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  53. Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems, 36.
  54. QED: A framework and dataset for explanations in question answering. Transactions of the Association for Computational Linguistics, 9:790–806.
  55. Claire Lefebvre. 2011. Substrate features in the properties of verbs in three atlantic creoles: Haitian Creole, Saramaccan and Papiamentu. In Claire Lefebvre, editor, Creoles, their Substrates, and Language Typology, pages 127–154. John Benjamins, Amsterdam.
  56. What a creole wants, what a creole needs. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6439–6449, Marseille, France. European Language Resources Association.
  57. CreoleVal: Multilingual multitask benchmarks for creoles.
  58. William Lewis. 2010. Haitian Creole: How to build and ship an MT engine from scratch in 4 days, 17 hours, & 30 minutes. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation, Saint Raphaël, France. European Association for Machine Translation.
  59. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  60. Singlish message paraphrasing: A joint task of creole translation and text normalization. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3924–3936.
  61. The resilience of land tenure regimes during hurricane irma: How colonial legacies impact disaster response and recovery in antigua and barbuda. Journal of Extreme Events, 6(01):1940004.
  62. Rhoda Margesson and Maureen Taft-Morales. 2010. Haiti earthquake: Crisis and response. Library of Congress Washington DC Congressional Research Service.
  63. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
  64. The Johns Hopkins University Bible corpus: 1600+ tongues for typological exploration. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2884–2892, Marseille, France. European Language Resources Association.
  65. The johns hopkins university bible corpus: 1600+ tongues for typological exploration. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2884–2892.
  66. Juan McCartney. 2013. The rise of the haitian population: Community expands since independence. The Nassau Guardian, 17.
  67. John McWhorter. 2000. The missing Spanish creoles: Recovering the birth of plantation contact languages. Univ of California Press.
  68. APiCS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  69. A fast method to filter noisy parallel data wmt2023 shared task on parallel data curation. In Proceedings of the Eighth Conference on Machine Translation, pages 359–365.
  70. How to parse a creole: When martinican creole meets french. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4397–4406.
  71. Salikoko Mufwene. 2002. Socio-economic historical arguments for a gradual and heterogeneous development of patois in Jamaica. In Biennial Meeting of Society for Caribbean Linguistics.
  72. Salikoko S Mufwene. 2008. Pidgins and creoles. In The Handbook of World Englishes, pages 313–327. Blackwell Publishing Ltd, Oxford, UK.
  73. SemEval-2023 task 12: Sentiment analysis for African languages (AfriSenti-SemEval). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2319–2337, Toronto, Canada. Association for Computational Linguistics.
  74. NaijaSenti: A nigerian twitter sentiment corpus for multilingual sentiment analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 590–602.
  75. Latin american, african and asian immigrants working in brazilian organizations: facing the language barrier. Revista Economia & Gestão, 20(55):87–101.
  76. No language left behind: Scaling human-centered machine translation.
  77. Sebastian Nordhoff and Harald Hammarström. 2011. Glottolog/langdoc: Defining dialects, languages, and language families as collections of resources. In First International Workshop on Linked Science 2011-In conjunction with the International Semantic Web Conference (ISWC 2011).
  78. Kelechi Ogueji and Orevaoghene Ahia. 2019. Pidginunmt: Unsupervised neural machine translation from west african pidgin to english. arXiv preprint arXiv:1912.03444.
  79. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim. Leibniz-Institut für Deutsche Sprache.
  80. Semantic enrichment of nigerian pidgin english for contextual sentiment classification. arXiv preprint arXiv:2003.12450.
  81. Robert Antoine Papen. 1978. The French-based Creoles of the Indian Ocean: an Analysis and Comparison. University of California, San Diego.
  82. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  83. A. Parépou and M. Fauquenoy. 1987. Atipa: (roman guyanais) Paris, A. Ghio, 1885. Textes, études et documents. Editions l’Harmattan.
  84. Charmane M Perry. 2023. ‘real bahamians’ and ‘paper bahamians’: Haitians as perpetual foreigners. Latin American and Caribbean Ethnic Studies, 18(1):122–140.
  85. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computational Linguistics.
  86. Wortubuku fu Sranan Tongo.
  87. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  88. Aaliya Rajah-Carrim. 2009. Use and Standardisation of Mauritian Creole in Electronically Mediated Communication1. Journal of Computer-Mediated Communication, 14(3):484–508.
  89. Development and validation of a haitian creole screening instrument for depression. Transcultural psychiatry, 52(1):33–57.
  90. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  91. Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  92. John R Rickford and John McWhorter. 2017. Language contact and language generation: Pidgins and creoles. The handbook of sociolinguistics, pages 238–256.
  93. Data-adaptive transfer learning for translation: A case study in Haitian and jamaican. In Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022), pages 35–42, Gyeongju, Republic of Korea. Association for Computational Linguistics.
  94. When is tts augmentation through a pivot language useful? In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 2022, pages 3538–3542.
  95. ChatGPT MT: Competitive for high- (but not low-) resource languages. In Proceedings of the Eighth Conference on Machine Translation, pages 392–418, Singapore. Association for Computational Linguistics.
  96. African substrates rather than european lexifiers to augment african-diaspora creole translation. In 4th Workshop on African Natural Language Processing.
  97. Ulisdete Rodrigues. 2007. Fonologia do caboverdiano : das variedades insulares à unidade nacional.
  98. Simon Romero. 2010. A language thrives in its caribbean home. New York Times, 4.
  99. Generating synthetic audio data for attention-based speech recognition systems. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7069–7073.
  100. CCMatrix: Mining billions of high-quality parallel sentences on the web. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6490–6500, Online. Association for Computational Linguistics.
  101. Luisa Seguin. 2020. Transparency and language contact: The case of haitian creole, french, and fongbe. Journal of Pidgin and Creole Languages, 35(2):218–252.
  102. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  103. Findings of the WMT 2023 shared task on parallel data curation. In Proceedings of the Eighth Conference on Machine Translation, pages 95–102.
  104. Keston Smith. 2022. Trinidad english creole to english dataset.
  105. Steinþór Steingrímsson. 2023. A sentence alignment approach to document alignment and multi-faceted filtering for curating parallel sentence pairs from web-crawled data. In Proceedings of the Eighth Conference on Machine Translation, pages 366–374.
  106. Multilingual translation from denoising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3450–3466, Online. Association for Computational Linguistics.
  107. New haitian migration patterns end in displacement. UCLA Latin American Institute, April, 17.
  108. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
  109. Albert Valdman. 2000. L’évolution du lexique dans les créoles à base lexicale française. L’Information Grammaticale, 85:53–60.
  110. Albert Valdman. 2005. Vers la standardisation du créole haïtien. Revue Francaise De Linguistique Appliquee, 10:39–52.
  111. Viveka Velupillai. 2015. Pidgins, creoles and mixed languages. Creole Language Library. John Benjamins Publishing, Amsterdam, Netherlands.
  112. Universal dependencies parsing for colloquial singaporean english. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1732–1744.
  113. Donald Winford. 1997. Re-examining caribbean english creole continua. World Englishes, 16(2):233–279.
  114. Lego-mt: Learning detachable models for massively multilingual machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11518–11533.
  115. Lego-MT: Learning detachable models for massively multilingual machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11518–11533, Toronto, Canada. Association for Computational Linguistics.
  116. Joseph Zhong. 2023. Haiti and the dominican republic’s long road to economic growth divergence.
  117. Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint arXiv:2304.04675.
  118. Just Zwennicker and David Stap. 2022. Towards a general purpose machine translation system for sranantongo. arXiv preprint arXiv:2212.06383.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com