Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021 (2403.01196v1)
Abstract: Translation models for the specific domain of translating Covid data from English to Irish were developed for the LoResMT 2021 shared task. Domain adaptation techniques, using a Covid-adapted generic 55k corpus from the Directorate General of Translation, were applied. Fine-tuning, mixed fine-tuning and combined dataset approaches were compared with models trained on an extended in-domain dataset. As part of this study, an English-Irish dataset of Covid related data, from the Health and Education domains, was developed. The highest-performing model used a Transformer architecture trained with an extended in-domain Covid dataset. In the context of this study, we have demonstrated that extending an 8k in-domain baseline dataset by just 5k lines improved the BLEU score by 27 points.
- Optimizing transformer for low-resource neural machine translation. arXiv preprint arXiv:2011.02266.
- Bisong, E. (2019). Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform, pages 59–64. Springer.
- An empirical comparison of domain adaptation methods for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385–391.
- Systran’s pure neural machine translation systems. arXiv preprint arXiv:1610.05540.
- Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
- Transformers for low resource languages: Is feidir linn. In Proceedings of the 18th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers).
- Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-Resource Languages. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Popović, M. (2015). chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395.
- Revisiting low-resource neural machine translation: A case study. arXiv preprint arXiv:1905.11901.
- A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, volume 200. Citeseer.
- Dgt-tm: A freely available translation memory in 22 languages. arXiv preprint arXiv:1309.5226.
- Attention is all you need. arXiv preprint arXiv:1706.03762.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Séamus Lankford (17 papers)
- Haithem Afli (13 papers)
- Andy Way (46 papers)