Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cyber Risks of Machine Translation Critical Errors : Arabic Mental Health Tweets as a Case Study (2405.11668v1)

Published 19 May 2024 in cs.CL

Abstract: With the advent of Neural Machine Translation (NMT) systems, the MT output has reached unprecedented accuracy levels which resulted in the ubiquity of MT tools on almost all online platforms with multilingual content. However, NMT systems, like other state-of-the-art AI generative systems, are prone to errors that are deemed machine hallucinations. The problem with NMT hallucinations is that they are remarkably \textit{fluent} hallucinations. Since they are trained to produce grammatically correct utterances, NMT systems are capable of producing mistranslations that are too fluent to be recognised by both users of the MT tool, as well as by automatic quality metrics that are used to gauge their performance. In this paper, we introduce an authentic dataset of machine translation critical errors to point to the ethical and safety issues involved in the common use of MT. The dataset comprises mistranslations of Arabic mental health postings manually annotated with critical error types. We also show how the commonly used quality metrics do not penalise critical errors and highlight this as a critical issue that merits further attention from researchers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Khetam Al Sharou and Lucia Specia. 2022. A taxonomy and study of critical errors in machine translation. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 171–180.
  2. Haifa K Aldayel and Aqil M Azmi. 2016. Arabic tweets sentiment analysis–a hybrid scheme. Journal of Information Science, 42(6):782–797.
  3. Arabic SentiWordNet in relation to SentiWordNet 3.0. 2180, 1266(4):1.
  4. Detecting arabic depressed users from twitter data. Procedia Computer Science, 163:257–265.
  5. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  6. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In 13th International Workshop on Semantic Evaluation, pages 54–63. Association for Computational Linguistics.
  7. BSI. 1973a. Natural Fibre Twines, 3rd edition. British Standards Institution, London. BS 2570.
  8. BSI. 1973b. Natural fibre twines. BS 2570, British Standards Institution, London. 3rd. edn.
  9. Michael Carl and M Cristina Toledo Báez. 2019. Machine translation errors and the translation process: a study across different languages. Journal of Specialised Translation, 31:107–132.
  10. The use of user modelling to guide inference and learning. Applied Intelligence, 2(1):37–53.
  11. J.L. Chercheur. 1994. Case-Based Reasoning, 2nd edition. Morgan Kaufman Publishers, San Mateo, CA.
  12. N. Chomsky. 1973. Conditions on transformations. In A festschrift for Morris Halle, New York. Holt, Rinehart & Winston.
  13. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  14. Toxicity in multilingual machine translation at scale. arXiv preprint arXiv:2210.03070.
  15. Suicidality in the arab world: results from an online screener. Community mental health journal, pages 1–8.
  16. Google translate error analysis for mental healthcare information: Evaluating accuracy, comprehensibility, and implications for multilingual healthcare communication. arXiv preprint arXiv:2402.04023.
  17. Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380.
  18. Bart Desmet and Véronique Hoste. 2013. Emotion detection in suicide notes. Expert Systems with Applications, 40(16):6351–6358.
  19. Umberto Eco. 1990. The Limits of Interpretation. Indian University Press.
  20. Alvin Grissom et al. 2022. Rare but severe neural machine translation errors induced by minimal deletion: An empirical study on chinese and english. International Committee on Computational Linguistics.
  21. Yinuo Guo and Junfeng Hu. 2019. Meteor++ 2.0: Adopt syntactic level paraphrase knowledge into machine translation evaluation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 501–506.
  22. Aradepsu: Detecting depression and suicidal ideation in arabic tweets using transformers. In Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), pages 302–311.
  23. Paul Gerhard Hoel. 1971a. Elementary Statistics, 3rd edition. Wiley series in probability and mathematical statistics. Wiley, New York, Chichester. ISBN 0 471 40300.
  24. Paul Gerhard Hoel. 1971b. Elementary Statistics, 3rd edition, Wiley series in probability and mathematical statistics, pages 19–33. Wiley, New York, Chichester. ISBN 0 471 40300.
  25. Tucker Ian. 2010. Twitter spreads regional slang. https://www.theguardian.com/science/2010/sep/05/tv-not-twitter-spreads-slang.
  26. Marwan Jarrah and Nimer Abusalim. 2021. In favour of the low ip area in the arabic clause structure: Evidence from the vso word order in jordanian arabic. Natural Language & Linguistic Theory, 39:123–156.
  27. Otto Jespersen. 1922. Language: Its Nature, Development, and Origin. Allen and Unwin.
  28. Suicidal ideation detection: A review of machine learning methods and applications. IEEE Transactions on Computational Social Systems, 8(1):214–226.
  29. Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. ACL 2017, page 28.
  30. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  31. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In: 7th International Conference on Learning Representations (2017). Http://arxiv.org/abs/1711.05101.
  32. Natural language processing reveals vulnerable mental health support groups and heightened health anxiety on reddit during covid-19: Observational study. Journal of medical Internet research, 22(10):e22635.
  33. Marianna Martindale and Marine Carpuat. 2018. Fluency Over Adequacy: A Pilot Study in Measuring User Trust in Imperfect MT. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 13–25.
  34. Identifying fluently inadequate output in neural and statistical machine translation. In Proceedings of Machine Translation Summit XVII: Research Track, pages 233–243.
  35. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  36. The inside story: Towards better understanding of machine translation neural evaluation metrics. arXiv preprint arXiv:2305.11806.
  37. Challenges in translation of emotions in multilingual user-generated content: Twitter as a case study. arXiv preprint arXiv:2106.10719.
  38. Analysing mistranslation of emotions in multilingual tweets by online mt tools. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 275–284.
  39. Hadeel Saadany and Constantin Orăsan. 2020. Is it Great or Terrible? Preserving Sentiment in Neural Machine Translation of Arabic Reviews. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 24–37.
  40. Hadeel Saadany and Constantin Orǎsan. 2021. Bleu, meteor, bertscore: Evaluation of metrics performance in assessing critical translation errors in sentiment-oriented text. In Proceedings of the Translation and Interpreting Technology Online Conference, pages 48–56.
  41. Sentiment-aware measure (sam) for evaluating sentiment transfer by machine translation systems. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1217–1226.
  42. A history of technology. Oxford University Press, London. 5 vol.
  43. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas. Cambridge, MA.
  44. Findings of the WMT 2021 shared task on quality estimation. In Association for Computational Linguistics.
  45. Jannik Strötgen and Michael Gertz. 2012. Temporal tagging on different domains: Challenges, strategies, and gold standards. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 3746–3753, Istanbul, Turkey. European Language Resource Association (ELRA).
  46. Is this translation error critical?: Classification-based human and automatic machine translation evaluation focusing on critical errors. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 46–55.
  47. Superheroes experiences with books, 20th edition. The Phantom Editors Associates, Gotham City.
  48. Kasturi Dewi Varathan and Nurhafizah Talib. 2014. Suicide detection system based on twitter. In 2014 Science and Information Conference, pages 785–788.
  49. Understanding the societal impacts of machine translation: a critical review of the literature on medical and legal use cases. Information, Communication & Society, 24(11):1515–1532.
  50. Google’s neural machine translation system: Bridging the gap between human and machine translation.
  51. Machine translation of Arabic dialects. In Proceedings of the 2012 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 49–59.
  52. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hadeel Saadany (14 papers)
  2. Ashraf Tantawy (8 papers)
  3. Constantin Orasan (33 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets