Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Word Closure-Based Metamorphic Testing for Machine Translation (2312.12056v2)

Published 19 Dec 2023 in cs.SE

Abstract: With the wide application of machine translation, the testing of Machine Translation Systems (MTSs) has attracted much attention. Recent works apply Metamorphic Testing (MT) to address the oracle problem in MTS testing. Existing MT methods for MTS generally follow the workflow of input transformation and output relation comparison, which generates a follow-up input sentence by mutating the source input and compares the source and follow-up output translations to detect translation errors, respectively. These methods use various input transformations to generate test case pairs and have successfully triggered numerous translation errors. However, they have limitations in performing fine-grained and rigorous output relation comparison and thus may report many false alarms and miss many true errors. In this paper, we propose a word closure-based output comparison method to address the limitations of the existing MTS MT methods. We first propose word closure as a new comparison unit, where each closure includes a group of correlated input and output words in the test case pair. Word closures suggest the linkages between the appropriate fragment in the source output translation and its counterpart in the follow-up output for comparison. Next, we compare the semantics on the level of word closure to identify the translation errors. In this way, we perform a fine-grained and rigorous semantic comparison for the outputs and thus realize more effective violation identification. We evaluate our method with the test cases generated by five existing input transformations and the translation outputs from three popular MTSs. Results show that our method significantly outperforms the existing works in violation identification by improving the precision and recall and achieving an average increase of 29.9% in F1 score. It also helps to increase the F1 score of translation error localization by 35.9%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. 2023. Bing Microsoft Translator. https://www.bing.com/translator.
  2. 2023. Google Translate. https://translate.google.com/.
  3. 2023. Youdao Translate. https://translate.google.com/.
  4. BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems. IEEE Trans. Software Eng. 48, 12 (2022), 5087–5101. https://doi.org/10.1109/TSE.2021.3136169
  5. The Oracle Problem in Software Testing: A Survey. IEEE Trans. Software Eng. 41, 5 (2015), 507–525. https://doi.org/10.1109/TSE.2014.2372785
  6. Terena Bell. 2021. Google Translate Causes Vaccine Mishap. https://multilingual.com/google-translate-causes-vaccine-mishap/.
  7. SemMT: A Semantic-Based Testing Approach for Machine Translation Systems. ACM Trans. Softw. Eng. Methodol. 31, 2 (2022), 34e:1–34e:36. https://doi.org/10.1145/3490488
  8. Dhivya Chandrasekaran and Vijay Mago. 2022. Evolution of Semantic Similarity - A Survey. ACM Comput. Surv. 54, 2 (2022), 41:1–41:37. https://doi.org/10.1145/3440755
  9. Testing Your Question Answering Software via Asking Recursively. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 104–116. https://doi.org/10.1109/ASE51524.2021.9678670
  10. Validation on machine reading comprehension software without annotated labels: a property-based method. In ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021. ACM, 590–602. https://doi.org/10.1145/3468264.3468569
  11. Metamorphic Testing: A New Approach for Generating Next Test Cases. Technical Report HKUST-CS98-01, Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong.
  12. Metamorphic Testing: A Review of Challenges and Opportunities. ACM Comput. Surv. 51, 1 (2018), 4:1–4:27. https://doi.org/10.1145/3143561
  13. Steve Clayton. 2013. Translation tech powers automatic subtitles for everyday life. https://blogs.microsoft.com/ai/translation-tech-powers-automatic-subtitles-for-everyday-life/.
  14. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.
  15. InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 10554–10562. https://doi.org/10.1609/AAAI.V36I10.21299
  16. Gareth Davies. 2017. Palestinian man is arrested by police after posting ‘Good morning’ in Arabic on Facebook which was wrongly translated as ‘attack them’. https://www.dailymail.co.uk/news/article-5005489/Good-morning-Facebook-post-leads-arrest-Palestinian.html.
  17. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/V1/N19-1423
  18. Zi-Yi Dou and Graham Neubig. 2021. Word Alignment by Fine-tuning Embeddings on Parallel Corpora. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021. Association for Computational Linguistics, 2112–2128. https://doi.org/10.18653/v1/2021.eacl-main.181
  19. Kawin Ethayarajh. 2019. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics, 55–65. https://doi.org/10.18653/V1/D19-1006
  20. Semantic and secure search over encrypted outsourcing cloud based on BERT. Frontiers Comput. Sci. 16, 2 (2022), 162802. https://doi.org/10.1007/S11704-021-0277-0
  21. Context-dependent interpretation of words: Evidence for interactive neural processes. NeuroImage 35, 3 (2007), 1278–1286. https://doi.org/10.1016/J.NEUROIMAGE.2007.01.015
  22. Jin Guo. 1997. Critical Tokenization and its Properties. Comput. Linguistics 23, 4 (1997), 569–596.
  23. Effective Parallel Corpus Mining using Bilingual Sentence Embeddings. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018. Association for Computational Linguistics, 165–176. https://doi.org/10.18653/V1/W18-6317
  24. Machine translation testing via pathological invariance. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020. ACM, 863–875. https://doi.org/10.1145/3368089.3409756
  25. Structure-invariant testing for machine translation. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020. ACM, 961–973. https://doi.org/10.1145/3377811.3380339
  26. Testing Machine Translation via Referential Transparency. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 410–422. https://doi.org/10.1109/ICSE43902.2021.00047
  27. Automated Testing for Machine Translation via Constituency Invariance. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 468–479. https://doi.org/10.1109/ASE51524.2021.9678715
  28. Evaluating Natural Language Inference Models: A Metamorphic Testing Approach. In 32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021, Wuhan, China, October 25-28, 2021. IEEE, 220–230. https://doi.org/10.1109/ISSRE52982.2021.00033
  29. On the effectiveness of testing sentiment analysis systems with metamorphic testing. Inf. Softw. Technol. 150 (2022), 106966. https://doi.org/10.1016/J.INFSOF.2022.106966
  30. Property-based Test for Part-of-Speech Tagging Tool. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 1306–1311. https://doi.org/10.1109/ASE51524.2021.9678807
  31. Philipp Koehn and Christof Monz. 2006. Manual and Automatic Evaluation of Machine Translation between European Languages. In Proceedings on the Workshop on Statistical Machine Translation, WMT@HLT-NAACL 2006, New York City, NY, USA, June 8-9, 2006. Association for Computational Linguistics, 102–121.
  32. Analogical Reasoning on Chinese Morphological and Semantic Relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers. Association for Computational Linguistics, 138–143. https://doi.org/10.18653/V1/P18-2023
  33. Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 (Findings of ACL, Vol. ACL/IJCNLP 2021). Association for Computational Linguistics, 1003–1016. https://doi.org/10.18653/V1/2021.FINDINGS-ACL.86
  34. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81.
  35. DialTest: automated testing for recurrent-neural-network-driven dialogue systems. In ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark, July 11-17, 2021, Cristian Cadar and Xiangyu Zhang (Eds.). ACM, 115–126. https://doi.org/10.1145/3460319.3464829
  36. QATest: A Uniform Fuzzing Framework for Question Answering Systems. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 81:1–81:12. https://doi.org/10.1145/3551349.3556929
  37. Automatically Building a Stopword List for an Information Retrieval System. J. Digit. Inf. Manag. 3, 1 (2005), 3–8.
  38. Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. ijcai.org, 458–465. https://doi.org/10.24963/IJCAI.2020/64
  39. Fiona Macdonald. 2015. The greatest mistranslations ever. https://www.bbc.com/culture/article/20150202-the-greatest-mistranslations-ever.
  40. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
  41. Linguistic Regularities in Continuous Space Word Representations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA. The Association for Computational Linguistics, 746–751.
  42. George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39–41. https://doi.org/10.1145/219717.219748
  43. An Approach to Software Testing of Machine Learning Applications. In Proceedings of the Nineteenth International Conference on Software Engineering & Knowledge Engineering (SEKE’2007), Boston, Massachusetts, USA, July 9-11, 2007. Knowledge Systems Institute Graduate School, 167.
  44. Arika Okrent. 2016. 9 Little Translation Mistakes That Caused Big Problems. https://www.mentalfloss.com/article/48795/9-little-translation-mistakes-caused-big-problems.
  45. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. ACL, 311–318. https://doi.org/10.3115/1073083.1073135
  46. A Monte Carlo Method for Metamorphic Testing of Machine Translation Services. In 3rd IEEE/ACM International Workshop on Metamorphic Testing, MET 2018, Gothenburg, Sweden, May 27, 2018. ACM, 38–45. https://doi.org/10.1145/3193977.3193980
  47. The Copenhagen Post. 2012. Police admit using Google translation in terror investigation was mistake. https://cphpost.dk/2012-12-12/general/police-admit-using-google-translation-in-terror-investigation-was-mistake/.
  48. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 4902–4912. https://doi.org/10.18653/V1/2020.ACL-MAIN.442
  49. Stephen Shankland. 2013. Google Translate now serves 200 million people daily. https://www.cnet.com/tech/services-and-software/google-translate-now-serves-200-million-people-daily.
  50. Natural Test Generation for Precise Testing of Question Answering Software. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 71:1–71:12. https://doi.org/10.1145/3551349.3556953
  51. Tomohiro Shigenobu. 2007. Evaluation and Usability of Back Translation for Intercultural Communication. In Usability and Internationalization. Global and Local User Interfaces, Second International Conference on Usability and Internationalization, UI-HCII 2007, Held as Part of HCI International 2007, Beijing, China, July 22-27, 2007, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 4560). Springer, 259–265. https://doi.org/10.1007/978-3-540-73289-1_31
  52. Harold L. Somers. 2005. Round-trip Translation: What Is It Good For?. In Proceedings of the Australasian Language Technology Workshop, ALTA 2005, Sydney, Australia, December 10-11, 2005. Australasian Language Technology Association, 127–133.
  53. Automatic testing and improvement of machine translation. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020. ACM, 974–985. https://doi.org/10.1145/3377811.3380420
  54. Improving Machine Translation Systems via Isotopic Replacement. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 1181–1192. https://doi.org/10.1145/3510003.3510206
  55. Peter Svenonius. 2002. Subject positions and the placement of adverbials. Subjects, expletives, and the EPP (2002), 201–242.
  56. Investigating the Use of Google Translate in” Terms and Conditions” in an Airline’s Official Website: Errors and Implications. PASAA: Journal of Language Teaching and Learning in Thailand (2015), 137–169.
  57. MTTM: Metamorphic Testing for Textual Content Moderation Software. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2387–2399. https://doi.org/10.1109/ICSE48619.2023.00200
  58. Detecting Failures of Neural Machine Translation in the Absence of Reference Translations. In 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN (Industry Track) 2019, Portland, OR, USA, June 24-27, 2019. IEEE, 1–4. https://doi.org/10.1109/DSN-INDUSTRY.2019.00007
  59. Wikipedia. 2023a. Lexical analysis. https://en.wikipedia.org/wiki/Lexical_analysis.
  60. Wikipedia. 2023b. Phrase. https://en.wikipedia.org/wiki/Phrase.
  61. Wikipedia. 2023c. Verb phrase. https://en.wikipedia.org/wiki/Verb_phrase.
  62. WMT. 2018. News-Commentary. http://data.statmt.org/wmt18/translation-task/.
  63. qaAskeR++{}^{\mbox{+}}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT: a novel testing method for question answering software via asking recursive questions. Autom. Softw. Eng. 30, 1 (2023), 14. https://doi.org/10.1007/S10515-023-00380-2
  64. Automated Testing and Improvement of Named Entity Recognition Systems. CoRR abs/2308.07937 (2023). https://doi.org/10.48550/ARXIV.2308.07937 arXiv:2308.07937
  65. Zhi Quan Zhou and Liqun Sun. 2018. Metamorphic Testing for Machine Translations: MT4MT. In 25th Australasian Software Engineering Conference, ASWEC 2018, Adelaide, Australia, November 26-30, 2018. IEEE Computer Society, 96–100. https://doi.org/10.1109/ASWEC.2018.00021
  66. Fast and Accurate Shift-Reduce Constituent Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers. The Association for Computer Linguistics, 434–443.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiaoyuan Xie (10 papers)
  2. Shuo Jin (12 papers)
  3. Songqiang Chen (10 papers)
  4. Shing-Chi Cheung (54 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.