Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

gaHealth: An English-Irish Bilingual Corpus of Health Data (2403.03575v1)

Published 6 Mar 2024 in cs.CL and cs.AI

Abstract: Machine Translation is a mature technology for many high-resource language pairs. However in the context of low-resource languages, there is a paucity of parallel data datasets available for developing translation models. Furthermore, the development of datasets for low-resource languages often focuses on simply creating the largest possible dataset for generic translation. The benefits and development of smaller in-domain datasets can easily be overlooked. To assess the merits of using in-domain data, a dataset for the specific domain of health was developed for the low-resource English to Irish language pair. Our study outlines the process used in developing the corpus and empirically demonstrates the benefits of using an in-domain dataset for the health domain. In the context of translating health-related data, models developed using the gaHealth corpus demonstrated a maximum BLEU score improvement of 22.2 points (40%) when compared with top performing models from the LoResMT2021 Shared Task. Furthermore, we define linguistic guidelines for developing gaHealth, the first bilingual corpus of health data for the Irish language, which we hope will be of use to other creators of low-resource data sets. gaHealth is now freely available online and is ready to be explored for further research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. (2017). Sentiment Translation for Low-resourced Languages: Experiments on Irish General Election Tweets.
  2. (2016). Using the wayback machine to mine websites in the social sciences: a methodological resource. Journal of the Association for Information Science and Technology, 67(8):1904–1915.
  3. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  4. Bisong, E. (2019). Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform, pages 59–64. Springer.
  5. (2014). N-gram Counts and Language Models from the Common Crawl. In LREC, volume 2, page 4. Citeseer.
  6. (2021). The UCF Systems for the LoResMT 2021 Machine Translation Shared Task. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 129–133.
  7. (2017). Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
  8. (2017). Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872.
  9. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  10. (2021a). Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 144–150.
  11. (2021b). Transformers for low-resource languages: Is féidir linn! In Proceedings of the 18th Biennial Machine Translation Summit (Volume 1: Research Track), pages 48–60.
  12. (2016). FaDA: fast document aligner using word embedding. Prague Bulletin of Mathematical Linguistics, 106:169–179.
  13. Nakatani, S. (2010). Language detection library for java. https://github.com/shuyo/language-detection.
  14. (2021). Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-resource Languages. arXiv preprint arXiv:2108.06598.
  15. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA.
  16. Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395.
  17. (2021). Attentive fine-tuning of transformers for translation of low-resourced languages@ loresmt 2021. arXiv preprint arXiv:2108.08556.
  18. (2019). Revisiting low-resource neural machine translation: A case study. arXiv preprint arXiv:1905.11901.
  19. (2016). Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891.
  20. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231.
  21. (2005). Parallel corpora for medium density languages. In Proceedings of the Recent Advances in Natural Language Processing (RANLP 2005), pages 590–596, Borovets, Bulgaria.
  22. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Séamus Lankford (17 papers)
  2. Haithem Afli (13 papers)
  3. Órla Ní Loinsigh (1 paper)
  4. Andy Way (46 papers)
Citations (8)