Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CANTONMT: Investigating Back-Translation and Model-Switch Mechanisms for Cantonese-English Neural Machine Translation (2405.08172v1)

Published 13 May 2024 in cs.CL and cs.AI

Abstract: This paper investigates the development and evaluation of machine translation models from Cantonese to English, where we propose a novel approach to tackle low-resource language translations. The main objectives of the study are to develop a model that can effectively translate Cantonese to English and evaluate it against state-of-the-art commercial models. To achieve this, a new parallel corpus has been created by combining different available corpora online with preprocessing and cleaning. In addition, a monolingual Cantonese dataset has been created through web scraping to aid the synthetic parallel corpus generation. Following the data collection process, several approaches, including fine-tuning models, back-translation, and model switch, have been used. The translation quality of models has been evaluated with multiple quality metrics, including lexicon-based metrics (SacreBLEU and hLEPOR) and embedding-space metrics (COMET and BERTscore). Based on the automatic metrics, the best model is selected and compared against the 2 best commercial translators using the human evaluation framework HOPES. The best model proposed in this investigation (NLLB-mBART) with model switch mechanisms has reached comparable and even better automatic evaluation scores against State-of-the-art commercial models (Bing and Baidu Translators), with a SacreBLEU score of 16.8 on our test set. Furthermore, an open-source web application has been developed to allow users to translate between Cantonese and English, with the different trained models available for effective comparisons between models from this investigation and users. CANTONMT is available at https://github.com/kenrickkung/CantoneseTranslation

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.0473
  2. Robert S. Bauer. 2006. THE STRATIFICATION OF English LOANWORDS IN Cantonese. Journal of Chinese Linguistics 34, 2 (2006), 172–191. http://www.jstor.org/stable/23754122
  3. Ondřej Bojar and Aleš Tamchyna. 2011. Improving Translation Model by Monolingual Data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan (Eds.). Association for Computational Linguistics, Edinburgh, Scotland, 330–336. https://aclanthology.org/W11-2138
  4. A statistical approach to language translation. In Proceedings of the 12th Conference on Computational Linguistics - Volume 1 (Budapest, Hungry) (COLING ’88). Association for Computational Linguistics, USA, 71–76. https://doi.org/10.3115/991635.991651
  5. Findings of the 2012 Workshop on Statistical Machine Translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia (Eds.). Association for Computational Linguistics, Montréal, Canada, 10–51. https://aclanthology.org/W12-3102
  6. Re-evaluating the Role of Bleu in Machine Translation Research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, Diana McCarthy and Shuly Wintner (Eds.). Association for Computational Linguistics, Trento, Italy, 249–256. https://aclanthology.org/E06-1032
  7. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). Association for Computational Linguistics, Doha, Qatar, 1724–1734. https://doi.org/10.3115/v1/D14-1179
  8. Jacob Cohen. 1968. Weighted Kappa - Nominal Scale Agreement with Provision for Scaled Disagreement Or Partial Credit. Psychological bulletin 70 (10 1968), 213–20. https://doi.org/10.1037/h0026256
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  10. Jinhua Du and Andy Way. 2017. Pinyin as Subword Unit for Chinese-Sourced Neural Machine Translation. In Irish Conference on Artificial Intelligence and Cognitive Science. https://api.semanticscholar.org/CorpusID:19187847
  11. Ethnologue: Languages of the World (26th ed.). SIL International.
  12. Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Marco Turchi, and Marcos Zampieri (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 46–68. https://aclanthology.org/2022.wmt-1.2
  13. Serge Gladkoff and Lifeng Han. 2022. HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 13–21. https://aclanthology.org/2022.lrec-1.2
  14. Measuring Uncertainty in Translation Quality Evaluation (TQE). In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 1454–1461. https://aclanthology.org/2022.lrec-1.156
  15. Generalizing Back-Translation in Neural Machine Translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, Florence, Italy, 45–52. https://doi.org/10.18653/v1/W19-5205
  16. Language-independent Model for Machine Translation Evaluation with Reinforced Factors. In Proceedings of Machine Translation Summit XIV: Posters, Andy Way, Khalil Sima’an, and Mikel L. Forcada (Eds.). Nice, France. https://aclanthology.org/2013.mtsummit-posters.3
  17. A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task. In Proceedings of the Eighth Workshop on Statistical Machine Translation, Ondrej Bojar, Christian Buck, Chris Callison-Burch, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Herve Saint-Amand, Radu Soricut, and Lucia Specia (Eds.). Association for Computational Linguistics, Sofia, Bulgaria, 414–421. https://aclanthology.org/W13-2253
  18. Lifeng Han. 2022. An investigation into multi-word expressions in machine translation. Ph. D. Dissertation. Dublin City University.
  19. Neural machine translation of clinical text: an empirical investigation into multilingual pre-trained language models and transfer-learning. Frontiers in Digital Health 6 (2024), 1211564.
  20. AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Stella Markantonatou, John McCrae, Jelena Mitrović, Carole Tiberius, Carlos Ramisch, Ashwini Vaidya, Petya Osenova, and Agata Savary (Eds.). Association for Computational Linguistics, online, 44–57. https://aclanthology.org/2020.mwe-1.6
  21. Justin Chun Ting Ho and Norman Hoi Kwan Or. 2020. LIHKGr. https://github.com/justinchuntingho/LIHKGr. An application for scraping LIHKG.
  22. Iterative Back-Translation for Neural Machine Translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, Alexandra Birch, Andrew Finch, Thang Luong, Graham Neubig, and Yusuke Oda (Eds.). Association for Computational Linguistics, Melbourne, Australia, 18–24. https://doi.org/10.18653/v1/W18-2703
  23. The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations. https://openreview.net/forum?id=rygGQyrFvH
  24. CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models Using Synthetic Back-Translation Data. arXiv:2403.11346 [cs.CL]
  25. Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
  26. Statistical phrase-based translation. In 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Langauge Technology (HLT-NAACL 2003). Association for Computational Linguistics, 48–54.
  27. J Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33 1 (1977), 159–74. https://api.semanticscholar.org/CorpusID:11077516
  28. PyCantonese: Cantonese Linguistics and NLP in Python. In Proceedings of The 13th Language Resources and Evaluation Conference. European Language Resources Association.
  29. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
  30. Using Social Media to Analyze Public Concerns and Policy Responses to COVID-19 in Hong Kong. ACM Trans. Manage. Inf. Syst. 12, 4, Article 30 (sep 2021), 20 pages. https://doi.org/10.1145/3460124
  31. Evelyn Kai-Yan Liu. 2022. Low-Resource Neural Machine Translation: A Case Study of Cantonese. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, and Marcos Zampieri (Eds.). Association for Computational Linguistics, Gyeongju, Republic of Korea, 28–40. https://aclanthology.org/2022.vardial-1.4
  32. Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics 8 (2020), 726–742. https://doi.org/10.1162/tacl_a_00343
  33. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672 [cs.CL]
  34. OpenAI. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  35. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  36. A Data Augmentation Method for English-Vietnamese Neural Machine Translation. IEEE Access 11 (2023), 28034–28044. https://doi.org/10.1109/ACCESS.2023.3252898
  37. Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, Brussels, Belgium, 186–191. https://doi.org/10.18653/v1/W18-6319
  38. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Katrin Erk and Noah A. Smith (Eds.). Association for Computational Linguistics, Berlin, Germany, 86–96. https://doi.org/10.18653/v1/P16-1009
  39. Amane Sugiyama and Naoki Yoshinaga. 2019. Data augmentation using back-translation for context-aware neural machine translation. In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Andrei Popescu-Belis, Sharid Loáiciga, Christian Hardmeier, and Deyi Xiong (Eds.). Association for Computational Linguistics, Hong Kong, China, 35–44. https://doi.org/10.18653/v1/D19-6504
  40. Sequence to Sequence Learning with Neural Networks. In Proc. NIPS. Montreal, CA. http://arxiv.org/abs/1409.3215
  41. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. arXiv:2008.00401 [cs.CL]
  42. Jörg Tiedemann and Lars Nygaard. 2004. The OPUS Corpus - Parallel and Free: http://logos.uio.no/opus. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa, and Raquel Silva (Eds.). European Language Resources Association (ELRA), Lisbon, Portugal. http://www.lrec-conf.org/proceedings/lrec2004/pdf/320.pdf
  43. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT — Building open translation services for the World. In Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT). Lisbon, Portugal.
  44. Attention is All You Need. https://arxiv.org/pdf/1706.03762.pdf
  45. Warren Weaver. 1952. Translation. In Proceedings of the Conference on Mechanical Translation (Massachusetts Institute of Technology). https://aclanthology.org/1952.earlymt-1.1
  46. Jeroen Wiedenhof. 2015. A Grammar of Mandarin. John Benjamins, Amsterdam.
  47. Liu Hey Wing. 2020. Machine translation models for Cantonese-English translation Project Plan. (2020).
  48. Mckay Wrigley. 2023. ai-code-translator. https://github.com/mckaywrigley/ai-code-translator
  49. A Structural-Based Approach to Cantonese-English Machine Translation. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 11, Number 2, June 2006. 137–158. https://aclanthology.org/O06-3003
  50. When Cantonese NLP Meets Pre-training: Progress and Challenges. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts, Miguel A. Alonso and Zhi Wei (Eds.). Association for Computational Linguistics, 16–21. https://aclanthology.org/2022.aacl-tutorials.3
  51. Hei Yi Mak and Tan Lee. 2022. Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong. In Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval (Sanya, China) (NLPIR ’21). Association for Computing Machinery, New York, NY, USA, 81–87. https://doi.org/10.1145/3508230.3508242
  52. Knowledge Graph Enhanced Neural Machine Translation via Multi-task Learning on Sub-entity Granularity. In Proceedings of the 28th International Conference on Computational Linguistics, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, Barcelona, Spain (Online), 4495–4505. https://doi.org/10.18653/v1/2020.coling-main.397

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets