Word Closure-Based Metamorphic Testing for Machine Translation (2312.12056v2)
Abstract: With the wide application of machine translation, the testing of Machine Translation Systems (MTSs) has attracted much attention. Recent works apply Metamorphic Testing (MT) to address the oracle problem in MTS testing. Existing MT methods for MTS generally follow the workflow of input transformation and output relation comparison, which generates a follow-up input sentence by mutating the source input and compares the source and follow-up output translations to detect translation errors, respectively. These methods use various input transformations to generate test case pairs and have successfully triggered numerous translation errors. However, they have limitations in performing fine-grained and rigorous output relation comparison and thus may report many false alarms and miss many true errors. In this paper, we propose a word closure-based output comparison method to address the limitations of the existing MTS MT methods. We first propose word closure as a new comparison unit, where each closure includes a group of correlated input and output words in the test case pair. Word closures suggest the linkages between the appropriate fragment in the source output translation and its counterpart in the follow-up output for comparison. Next, we compare the semantics on the level of word closure to identify the translation errors. In this way, we perform a fine-grained and rigorous semantic comparison for the outputs and thus realize more effective violation identification. We evaluate our method with the test cases generated by five existing input transformations and the translation outputs from three popular MTSs. Results show that our method significantly outperforms the existing works in violation identification by improving the precision and recall and achieving an average increase of 29.9% in F1 score. It also helps to increase the F1 score of translation error localization by 35.9%.
- 2023. Bing Microsoft Translator. https://www.bing.com/translator.
- 2023. Google Translate. https://translate.google.com/.
- 2023. Youdao Translate. https://translate.google.com/.
- BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems. IEEE Trans. Software Eng. 48, 12 (2022), 5087–5101. https://doi.org/10.1109/TSE.2021.3136169
- The Oracle Problem in Software Testing: A Survey. IEEE Trans. Software Eng. 41, 5 (2015), 507–525. https://doi.org/10.1109/TSE.2014.2372785
- Terena Bell. 2021. Google Translate Causes Vaccine Mishap. https://multilingual.com/google-translate-causes-vaccine-mishap/.
- SemMT: A Semantic-Based Testing Approach for Machine Translation Systems. ACM Trans. Softw. Eng. Methodol. 31, 2 (2022), 34e:1–34e:36. https://doi.org/10.1145/3490488
- Dhivya Chandrasekaran and Vijay Mago. 2022. Evolution of Semantic Similarity - A Survey. ACM Comput. Surv. 54, 2 (2022), 41:1–41:37. https://doi.org/10.1145/3440755
- Testing Your Question Answering Software via Asking Recursively. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 104–116. https://doi.org/10.1109/ASE51524.2021.9678670
- Validation on machine reading comprehension software without annotated labels: a property-based method. In ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021. ACM, 590–602. https://doi.org/10.1145/3468264.3468569
- Metamorphic Testing: A New Approach for Generating Next Test Cases. Technical Report HKUST-CS98-01, Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong.
- Metamorphic Testing: A Review of Challenges and Opportunities. ACM Comput. Surv. 51, 1 (2018), 4:1–4:27. https://doi.org/10.1145/3143561
- Steve Clayton. 2013. Translation tech powers automatic subtitles for everyday life. https://blogs.microsoft.com/ai/translation-tech-powers-automatic-subtitles-for-everyday-life/.
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.
- InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 10554–10562. https://doi.org/10.1609/AAAI.V36I10.21299
- Gareth Davies. 2017. Palestinian man is arrested by police after posting ‘Good morning’ in Arabic on Facebook which was wrongly translated as ‘attack them’. https://www.dailymail.co.uk/news/article-5005489/Good-morning-Facebook-post-leads-arrest-Palestinian.html.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/V1/N19-1423
- Zi-Yi Dou and Graham Neubig. 2021. Word Alignment by Fine-tuning Embeddings on Parallel Corpora. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021. Association for Computational Linguistics, 2112–2128. https://doi.org/10.18653/v1/2021.eacl-main.181
- Kawin Ethayarajh. 2019. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics, 55–65. https://doi.org/10.18653/V1/D19-1006
- Semantic and secure search over encrypted outsourcing cloud based on BERT. Frontiers Comput. Sci. 16, 2 (2022), 162802. https://doi.org/10.1007/S11704-021-0277-0
- Context-dependent interpretation of words: Evidence for interactive neural processes. NeuroImage 35, 3 (2007), 1278–1286. https://doi.org/10.1016/J.NEUROIMAGE.2007.01.015
- Jin Guo. 1997. Critical Tokenization and its Properties. Comput. Linguistics 23, 4 (1997), 569–596.
- Effective Parallel Corpus Mining using Bilingual Sentence Embeddings. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018. Association for Computational Linguistics, 165–176. https://doi.org/10.18653/V1/W18-6317
- Machine translation testing via pathological invariance. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020. ACM, 863–875. https://doi.org/10.1145/3368089.3409756
- Structure-invariant testing for machine translation. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020. ACM, 961–973. https://doi.org/10.1145/3377811.3380339
- Testing Machine Translation via Referential Transparency. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 410–422. https://doi.org/10.1109/ICSE43902.2021.00047
- Automated Testing for Machine Translation via Constituency Invariance. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 468–479. https://doi.org/10.1109/ASE51524.2021.9678715
- Evaluating Natural Language Inference Models: A Metamorphic Testing Approach. In 32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021, Wuhan, China, October 25-28, 2021. IEEE, 220–230. https://doi.org/10.1109/ISSRE52982.2021.00033
- On the effectiveness of testing sentiment analysis systems with metamorphic testing. Inf. Softw. Technol. 150 (2022), 106966. https://doi.org/10.1016/J.INFSOF.2022.106966
- Property-based Test for Part-of-Speech Tagging Tool. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 1306–1311. https://doi.org/10.1109/ASE51524.2021.9678807
- Philipp Koehn and Christof Monz. 2006. Manual and Automatic Evaluation of Machine Translation between European Languages. In Proceedings on the Workshop on Statistical Machine Translation, WMT@HLT-NAACL 2006, New York City, NY, USA, June 8-9, 2006. Association for Computational Linguistics, 102–121.
- Analogical Reasoning on Chinese Morphological and Semantic Relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers. Association for Computational Linguistics, 138–143. https://doi.org/10.18653/V1/P18-2023
- Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 (Findings of ACL, Vol. ACL/IJCNLP 2021). Association for Computational Linguistics, 1003–1016. https://doi.org/10.18653/V1/2021.FINDINGS-ACL.86
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81.
- DialTest: automated testing for recurrent-neural-network-driven dialogue systems. In ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark, July 11-17, 2021, Cristian Cadar and Xiangyu Zhang (Eds.). ACM, 115–126. https://doi.org/10.1145/3460319.3464829
- QATest: A Uniform Fuzzing Framework for Question Answering Systems. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 81:1–81:12. https://doi.org/10.1145/3551349.3556929
- Automatically Building a Stopword List for an Information Retrieval System. J. Digit. Inf. Manag. 3, 1 (2005), 3–8.
- Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. ijcai.org, 458–465. https://doi.org/10.24963/IJCAI.2020/64
- Fiona Macdonald. 2015. The greatest mistranslations ever. https://www.bbc.com/culture/article/20150202-the-greatest-mistranslations-ever.
- Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
- Linguistic Regularities in Continuous Space Word Representations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA. The Association for Computational Linguistics, 746–751.
- George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39–41. https://doi.org/10.1145/219717.219748
- An Approach to Software Testing of Machine Learning Applications. In Proceedings of the Nineteenth International Conference on Software Engineering & Knowledge Engineering (SEKE’2007), Boston, Massachusetts, USA, July 9-11, 2007. Knowledge Systems Institute Graduate School, 167.
- Arika Okrent. 2016. 9 Little Translation Mistakes That Caused Big Problems. https://www.mentalfloss.com/article/48795/9-little-translation-mistakes-caused-big-problems.
- Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. ACL, 311–318. https://doi.org/10.3115/1073083.1073135
- A Monte Carlo Method for Metamorphic Testing of Machine Translation Services. In 3rd IEEE/ACM International Workshop on Metamorphic Testing, MET 2018, Gothenburg, Sweden, May 27, 2018. ACM, 38–45. https://doi.org/10.1145/3193977.3193980
- The Copenhagen Post. 2012. Police admit using Google translation in terror investigation was mistake. https://cphpost.dk/2012-12-12/general/police-admit-using-google-translation-in-terror-investigation-was-mistake/.
- Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 4902–4912. https://doi.org/10.18653/V1/2020.ACL-MAIN.442
- Stephen Shankland. 2013. Google Translate now serves 200 million people daily. https://www.cnet.com/tech/services-and-software/google-translate-now-serves-200-million-people-daily.
- Natural Test Generation for Precise Testing of Question Answering Software. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 71:1–71:12. https://doi.org/10.1145/3551349.3556953
- Tomohiro Shigenobu. 2007. Evaluation and Usability of Back Translation for Intercultural Communication. In Usability and Internationalization. Global and Local User Interfaces, Second International Conference on Usability and Internationalization, UI-HCII 2007, Held as Part of HCI International 2007, Beijing, China, July 22-27, 2007, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 4560). Springer, 259–265. https://doi.org/10.1007/978-3-540-73289-1_31
- Harold L. Somers. 2005. Round-trip Translation: What Is It Good For?. In Proceedings of the Australasian Language Technology Workshop, ALTA 2005, Sydney, Australia, December 10-11, 2005. Australasian Language Technology Association, 127–133.
- Automatic testing and improvement of machine translation. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020. ACM, 974–985. https://doi.org/10.1145/3377811.3380420
- Improving Machine Translation Systems via Isotopic Replacement. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 1181–1192. https://doi.org/10.1145/3510003.3510206
- Peter Svenonius. 2002. Subject positions and the placement of adverbials. Subjects, expletives, and the EPP (2002), 201–242.
- Investigating the Use of Google Translate in” Terms and Conditions” in an Airline’s Official Website: Errors and Implications. PASAA: Journal of Language Teaching and Learning in Thailand (2015), 137–169.
- MTTM: Metamorphic Testing for Textual Content Moderation Software. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2387–2399. https://doi.org/10.1109/ICSE48619.2023.00200
- Detecting Failures of Neural Machine Translation in the Absence of Reference Translations. In 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN (Industry Track) 2019, Portland, OR, USA, June 24-27, 2019. IEEE, 1–4. https://doi.org/10.1109/DSN-INDUSTRY.2019.00007
- Wikipedia. 2023a. Lexical analysis. https://en.wikipedia.org/wiki/Lexical_analysis.
- Wikipedia. 2023b. Phrase. https://en.wikipedia.org/wiki/Phrase.
- Wikipedia. 2023c. Verb phrase. https://en.wikipedia.org/wiki/Verb_phrase.
- WMT. 2018. News-Commentary. http://data.statmt.org/wmt18/translation-task/.
- qaAskeR++{}^{\mbox{+}}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT: a novel testing method for question answering software via asking recursive questions. Autom. Softw. Eng. 30, 1 (2023), 14. https://doi.org/10.1007/S10515-023-00380-2
- Automated Testing and Improvement of Named Entity Recognition Systems. CoRR abs/2308.07937 (2023). https://doi.org/10.48550/ARXIV.2308.07937 arXiv:2308.07937
- Zhi Quan Zhou and Liqun Sun. 2018. Metamorphic Testing for Machine Translations: MT4MT. In 25th Australasian Software Engineering Conference, ASWEC 2018, Adelaide, Australia, November 26-30, 2018. IEEE Computer Society, 96–100. https://doi.org/10.1109/ASWEC.2018.00021
- Fast and Accurate Shift-Reduce Constituent Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers. The Association for Computer Linguistics, 434–443.
- Xiaoyuan Xie (10 papers)
- Shuo Jin (12 papers)
- Songqiang Chen (10 papers)
- Shing-Chi Cheung (54 papers)