Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021 (2403.01196v1)

Published 2 Mar 2024 in cs.CL and cs.AI

Abstract: Translation models for the specific domain of translating Covid data from English to Irish were developed for the LoResMT 2021 shared task. Domain adaptation techniques, using a Covid-adapted generic 55k corpus from the Directorate General of Translation, were applied. Fine-tuning, mixed fine-tuning and combined dataset approaches were compared with models trained on an extended in-domain dataset. As part of this study, an English-Irish dataset of Covid related data, from the Health and Education domains, was developed. The highest-performing model used a Transformer architecture trained with an extended in-domain Covid dataset. In the context of this study, we have demonstrated that extending an 8k in-domain baseline dataset by just 5k lines improved the BLEU score by 27 points.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Optimizing transformer for low-resource neural machine translation. arXiv preprint arXiv:2011.02266.
  2. Bisong, E. (2019). Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform, pages 59–64. Springer.
  3. An empirical comparison of domain adaptation methods for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385–391.
  4. Systran’s pure neural machine translation systems. arXiv preprint arXiv:1610.05540.
  5. Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
  6. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  7. Transformers for low resource languages: Is feidir linn. In Proceedings of the 18th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers).
  8. Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-Resource Languages. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages.
  9. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  10. Popović, M. (2015). chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395.
  11. Revisiting low-resource neural machine translation: A case study. arXiv preprint arXiv:1905.11901.
  12. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, volume 200. Citeseer.
  13. Dgt-tm: A freely available translation memory in 22 languages. arXiv preprint arXiv:1309.5226.
  14. Attention is all you need. arXiv preprint arXiv:1706.03762.
  15. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Séamus Lankford (17 papers)
  2. Haithem Afli (13 papers)
  3. Andy Way (46 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.