Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Vietnamese-English Medical Machine Translation (2403.19161v1)

Published 28 Mar 2024 in cs.CL

Abstract: Machine translation for Vietnamese-English in the medical domain is still an under-explored research area. In this paper, we introduce MedEV -- a high-quality Vietnamese-English parallel dataset constructed specifically for the medical domain, comprising approximately 360K sentence pairs. We conduct extensive experiments comparing Google Translate, ChatGPT (gpt-3.5-turbo), state-of-the-art Vietnamese-English neural machine translation models and pre-trained bilingual/multilingual sequence-to-sequence models on our new MedEV dataset. Experimental results show that the best performance is achieved by fine-tuning "vinai-translate" for each translation direction. We publicly release our dataset to promote further research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72.
  2. The IWSLT 2015 Evaluation Campaign. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, pages 2–14.
  3. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation. arXiv preprint, arXiv:2308.11596.
  4. PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4495–4503.
  5. CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5960–5969.
  6. Beyond english-centric multilingual machine translation. J. Mach. Learn. Res., 22(1).
  7. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  8. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations.
  9. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.
  10. MTet: Multi-domain Translation for English and Vietnamese. arXiv preprint, arXiv:2210.05610.
  11. A Fast and Accurate Vietnamese Word Segmenter. In Proceedings of the 11th International Conference on Language Resources and Evaluation, pages 2582–2587.
  12. A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association, pages 1726–1730.
  13. A Vietnamese-English Neural Machine Translation System. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association: Show and Tell, pages 5543–5544.
  14. KC4MT: A high-quality corpus for multilingual machine translation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5494–5502.
  15. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
  16. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191.
  17. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
  18. WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1351–1361.
  19. Rico Sennrich and Martin Volk. 2011. Iterative, MT-based sentence alignment of parallel texts. In Proceedings of the 18th Nordic Conference of Computational Linguistics, pages 175–182.
  20. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231.
  21. Jörg Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eight International Conference on Language Resources and Evaluation.
  22. Parallel corpora for medium density languages. In Proceedings of the International Conference Recent Advances in Natural Language Processing.
  23. VnCoreNLP: A Vietnamese natural language processing toolkit. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 56–60.
  24. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nhu Vo (1 paper)
  2. Dat Quoc Nguyen (55 papers)
  3. Dung D. Le (20 papers)
  4. Massimo Piccardi (21 papers)
  5. Wray Buntine (56 papers)

Summary

We haven't generated a summary for this paper yet.