Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Charles Translator: A Machine Translation System between Ukrainian and Czech (2404.06964v1)

Published 10 Apr 2024 in cs.CL

Abstract: We present Charles Translator, a machine translation system between Ukrainian and Czech, developed as part of a society-wide effort to mitigate the impact of the Russian-Ukrainian war on individuals and society. The system was developed in the spring of 2022 with the help of many language data providers in order to quickly meet the demand for such a service, which was not available at the time in the required quality. The translator was later implemented as an online web interface and as an Android app with speech input, both featuring Cyrillic-Latin script transliteration. The system translates directly, compared to other available systems that use English as a pivot, and thus take advantage of the typological similarity of the two languages. It uses the block back-translation method, which allows for efficient use of monolingual training data. The paper describes the development process, including data collection and implementation, evaluation, mentions several use cases, and outlines possibilities for the further development of the system for educational purposes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. The AMARA corpus: Building parallel language resources for the educational domain. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1856–1862, Reykjavik, Iceland. European Language Resources Association (ELRA).
  2. Inria-ALMAnaCH at WMT 2022: Does transcription help cross-script machine translation? In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 233–243, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  3. Efficient machine translation with model pruning and quantization. In Proceedings of the Sixth Conference on Machine Translation, pages 775–780, Online. Association for Computational Linguistics.
  4. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500, Brussels, Belgium. Association for Computational Linguistics.
  5. Findings of the WMT 2022 shared task on efficient translation. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 100–108, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  6. Marcin Junczys-Dowmunt. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 888–895, Belgium, Brussels. Association for Computational Linguistics.
  7. Marian: Fast neural machine translation in C++. In Proceedings of ACL 2018, System Demonstrations, pages 116–121, Melbourne, Australia. Association for Computational Linguistics.
  8. Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation, pages 1–42, Singapore. Association for Computational Linguistics.
  9. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  10. Adam Mickiewicz University at WMT 2022: NER-assisted and quality-aware neural machine translation. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 326–334, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  11. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  12. Martin Popel. 2018. CUNI transformer neural MT system for WMT18. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 482–487, Belgium, Brussels. Association for Computational Linguistics.
  13. CUNI systems for the WMT 22 Czech-Ukrainian translation task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 352–357, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  14. Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nature Communications, 11(4381):1–15.
  15. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  16. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  17. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
  18. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  19. The leipzig corpora collection-monolingual corpora of standard size. Proceedings of Corpus Linguistic, 2007.
  20. Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the bible in 100 languages. Language resources and evaluation, 49:375–395.
  21. Ona de Gibert Bonet. 2021. Legal Ukrainian Crawling.
  22. CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pages 5960–5969, Online. Association for Computational Linguistics.
  23. XLEnt: Mining a large cross-lingual entity dataset with lexical-semantic-phonetic word alignment. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10424–10430, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  24. Beyond english-centric multilingual machine translation. The Journal of Machine Learning Research, 22(1):4839–4886.
  25. Anastasiia Khaburska and Igor Tytyk. 2019. Toward language modeling for the ukrainian. Advances in Data Mining, Machine Learning, and Computer Vision. Proceedings, pages 71–80.
  26. Announcing CzEng 2.0 Parallel corpus with over 2 gigawords. CoRR, abs/2007.03006.
  27. Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 923–929, Portorož, Slovenia. European Language Resources Association (ELRA).
  28. Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512–4525, Online. Association for Computational Linguistics.
  29. Korpus InterCorp – Czech, Ukrainian. Version 14 from 31. 1. 2022.
  30. WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1351–1361, Online. Association for Computational Linguistics.
  31. Ccmatrix: Mining billions of high-quality parallel sentences on the WEB. CoRR, abs/1911.04944.
  32. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Martin Popel (14 papers)
  2. Lucie Poláková (2 papers)
  3. Michal Novák (8 papers)
  4. Jindřich Helcl (21 papers)
  5. Jindřich Libovický (36 papers)
  6. Pavel Straňák (1 paper)
  7. Tomáš Krabač (1 paper)
  8. Jaroslava Hlaváčová (2 papers)
  9. Mariia Anisimova (1 paper)
  10. Tereza Chlaňová (1 paper)

Summary

We haven't generated a summary for this paper yet.