Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Machine Translation Testing via Syntactic Tree Pruning (2401.00751v1)

Published 1 Jan 2024 in cs.CL and cs.SE

Abstract: Machine translation systems have been widely adopted in our daily life, making life easier and more convenient. Unfortunately, erroneous translations may result in severe consequences, such as financial losses. This requires to improve the accuracy and the reliability of machine translation systems. However, it is challenging to test machine translation systems because of the complexity and intractability of the underlying neural models. To tackle these challenges, we propose a novel metamorphic testing approach by syntactic tree pruning (STP) to validate machine translation systems. Our key insight is that a pruned sentence should have similar crucial semantics compared with the original sentence. Specifically, STP (1) proposes a core semantics-preserving pruning strategy by basic sentence structure and dependency relations on the level of syntactic tree representation; (2) generates source sentence pairs based on the metamorphic relation; (3) reports suspicious issues whose translations break the consistency property by a bag-of-words model. We further evaluate STP on two state-of-the-art machine translation systems (i.e., Google Translate and Bing Microsoft Translator) with 1,200 source sentences as inputs. The results show that STP can accurately find 5,073 unique erroneous translations in Google Translate and 5,100 unique erroneous translations in Bing Microsoft Translator (400% more than state-of-the-art techniques), with 64.5% and 65.4% precision, respectively. The reported erroneous translations vary in types and more than 90% of them cannot be found by state-of-the-art techniques. There are 9,393 erroneous translations unique to STP, which is 711.9% more than state-of-the-art techniques. Moreover, STP is quite effective to detect translation errors for the original sentences with a recall reaching 74.0%, improving state-of-the-art techniques by 55.1% on average.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. BBC. 2022. The British Broadcasting Corporation (BBC) News Homepage. Site: https://www.bbc.com/. Accessed August, 2022.
  2. Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and Natural Noise Both Break Neural Machine Translation. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18). 1–13.
  3. SemMT: A Semantic-based Testing Approach for Machine Translation Systems. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2 (2022), 1–36.
  4. Danqi Chen and Wen-tau Yih. 2020. Open-domain question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 34–37.
  5. Testing Your Question Answering Software Via Asking Recursively. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). 104–116.
  6. Metamorphic Testing: A New Approach For Generating Next Test Cases. arXiv preprint arXiv:2002.12543 (2020).
  7. Metamorphic Testing: A Review of Challenges and Opportunities. ACM Computing Surveys (CSUR) 51, 1 (2018), 1–27.
  8. Robust Neural Machine Translation with Doubly Adversarial Inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 4324–4333.
  9. Noam Chomsky. 2002. Syntactic Structures. Walter de Gruyter.
  10. CNN. 2022. The Cable News Network (CNN) News Homepage. Site: https://edition.cnn.com/. Accessed August, 2022.
  11. China Daily. 2022. China Daily News Homepage. Site: https://www.chinadaily.com.cn/.
  12. IBM Cloud Docs. 2016. Machine Translation Tips. Site: https://cloud.ibm.com/docs/GlobalizationPipeline?topic=GlobalizationPipeline-globalizationpipeline_tips&locale=en. Accessed August, 2022.
  13. Benchmarking Adversarial Robustness. arXiv preprint arXiv:1912.11852 (2019).
  14. Deepstellar: Model-based quantitative analysis of stateful deep learning systems. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’19). 477–487.
  15. On Adversarial Examples for Character-Level Neural Machine Translation. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18). 653–663.
  16. Google. 2022. Google Translate. Site: https://translate.google.com. Accessed August, 2022.
  17. Stanford NLP Group. 2022. CoreNLP. Site: https://stanfordnlp.github.io/CoreNLP. Accessed August, 2022.
  18. Machine Translation Testing Via Pathological Invariance. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’20). 863–875.
  19. Achieving Human Parity on Automatic Chinese to English News Translation. arXiv preprint arXiv:1803.05567 (2018).
  20. Pinjia He. 2022. Machine Translation Testing Toolkit. Site: https://github.com/RobustNLP/TestTranslation. Accessed August, 2022.
  21. Structure-invariant Testing for Machine Translation. In Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering (ICSE’20). 961–973.
  22. Testing Machine Translation via Referential Transparency. In Proceedings of the 43nd IEEE/ACM International Conference on Software Engineering (ICSE’21). 961–973.
  23. AEON: A Method for Automatic Evaluation of NLP Test Cases. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’22). 202–214.
  24. Rodney Huddleston. 1984. Introduction to the Grammar of English. Cambridge University Press.
  25. Automated Testing for Machine Translation via Constituency Invariance. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). 468–479.
  26. Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2021–2031.
  27. Guiding deep learning system testing using surprise adequacy. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1039–1049.
  28. Compiler Validation Via Equivalence Modulo Inputs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). 216–226.
  29. Shaohua Li and Zhendong Su. 2023. Accelerating Fuzzing through Prefix-Guided Execution. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 1–27.
  30. Explicit Sentence Compression for Neural Machine Translation. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI’20), Vol. 34. 8311–8318.
  31. Many-core Compiler Fuzzing. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’15). 65–76.
  32. Defensive Quantization: When Efficiency Meets Robustness. In International Conference on Learning Representations.
  33. Metamorphic Model-based Testing Applied on NASA DAT–an Experience Report. In Proceedings of the 37th IEEE/ACM International Conference on Software Engineering (ICSE’15), Vol. 2. 129–138.
  34. Lingua. 2022. The 20 Most Spoken Languages in the World in 2022. Site: https://lingua.edu/the-20-most-spoken-languages-in-the-world-in-2022/. Accessed August, 2022.
  35. Incomplete Utterance Rewriting as Semantic Segmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 2846–2857.
  36. John Lyons and Lyons John. 1995. Linguistic Semantics: An Introduction. Cambridge University Press.
  37. MODE: automated neural network model debugging via state differential analysis and input selection. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’18). 175–186.
  38. William C Mann and Sandra A Thompson. 1988. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text-interdisciplinary Journal for the Study of Discourse (Text & Talk) 8, 3 (1988), 243–281.
  39. The Stanford Corenlp Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL’14). 55–60.
  40. Microsoft. 2022. Bing Microsoft Translator. Site: https://www.bing.com/translator. Accessed August, 2022.
  41. Did the Model Understand the Question?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 1896–1906.
  42. Properties of Machine Learning Applications for Use in Metamorphic Testing. In Proceedings of the 20th International Conference on Software Engineering & Knowledge Engineering (SEKE’08). 867–872.
  43. Transforming Complex Sentences into a Semantic Hierarchy. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3415–3427.
  44. Analyzing Uncertainty in Neural Machine Translation. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). 3956–3965.
  45. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP’16). IEEE, 582–597.
  46. A Monte Carlo method for metamorphic testing of machine translation services. In 2018 IEEE/ACM 3rd International Workshop on Metamorphic Testing (MET’18). IEEE, 38–45.
  47. Randolph Quirk. 2010. A Comprehensive Grammar of the English Language. Pearson Education India.
  48. Reuters. 2022. Reuters News Homepage. Site: https://www.reuters.com/. Accessed August, 2022.
  49. A Survey on Metamorphic Testing. IEEE Transactions on software engineering (TSE) 42, 9 (2016), 805–824.
  50. Natural Test Generation for Precise Testing of Question Answering Software. In IEEE/ACM Automated Software Engineering (ASE’22).
  51. Compressing Pre-trained Models of Code into 3 MB. In 37th IEEE/ACM International Conference on Automated Software Engineering (ASE’22). 1–12.
  52. A Survey on Text Simplification. arXiv preprint arXiv:2008.08612 (2020).
  53. Liqun Sun and Zhi Quan Zhou. 2018. Metamorphic testing for machine translations: MT4MT. In 2018 25th Australasian Software Engineering Conference (ASWEC’18). IEEE, 96–100.
  54. Automatic testing and improvement of machine translation. In Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering (ICSE’20). 974–985.
  55. Improving Machine Translation Systems Via Isotopic Replacement. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22).
  56. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering (ICSE’18). 303–314.
  57. Barak Turovsky. 2016. Ten Years of Google Translate. https://blog.google/products/translate/ten-years-of-google-translate/.
  58. Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22). 287–298.
  59. Adversarial sample detection for deep neural network through model mutation testing. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1245–1256.
  60. Detecting Failures of Neural Machine Translation in the Absence of Reference Translations. In Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks–Industry Track (DSN’19). 1–4.
  61. Semantics-preserving Bag-of-words Models and Applications. IEEE Transactions on Image Processing (TIP) 19, 7 (2010), 1908–1920.
  62. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016).
  63. Generating 3d adversarial point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 9136–9144.
  64. Meshadv: Adversarial meshes for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6898–6907.
  65. Testing And Validating Machine Learning Classifiers By Metamorphic Testing. Journal of Systems and Software (JSS) 84, 4 (2011), 544–558.
  66. Discourse-aware neural extractive text summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’18). 5021–5031.
  67. Youdao. 2022. Youdao Translator. Site: http://www.youdao.com. Accessed August, 2022.
  68. Automated testing of image captioning systems. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’22). 467–479.
  69. DeepSearch: Simple and Effective Blackbox Fuzzing of Deep Neural Networks. arXiv preprint arXiv:1910.06296 (2019).
  70. Search-based Inference of Polynomial Metamorphic Relations. In Proceedings of the 29th IEEE/ACM International Conference on Automated Software Engineering (ASE’14). 701–712.
  71. Deeproad: Gan-based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. In Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE’18). 132–142.
  72. Quanjun Zhang and Haichuan Hu. 2023. STP Reproduction Artifacts. Site: https://github.com/iSEngLab/STP. Accessed December, 2023.
  73. Crafting adversarial examples for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL’21). 1967–1977.
  74. An Empirical Study on Tensorflow Program Bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’18). 129–140.
  75. Graph Convolution over Pruned Dependency Trees Improves Relation Extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2205–2215.
  76. Testing Untestable Neural Machine Translation: An Industrial Case. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering: Companion Proceedings (ICSE-Companion’19). 314–315.
  77. Metamorphic Testing for Software Quality Assessment: A Study of Search Engines. IEEE Transactions on Software Engineering (TSE) 42, 3 (2015), 264–284.
  78. Automated Functional Testing of Online Search Services. Software Testing, Verification and Reliability (STVR) 22, 4 (2012), 221–243.
  79. Fast and Accurate Shift-reduce Constituent Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL’13). 434–443.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Quanjun Zhang (36 papers)
  2. Juan Zhai (26 papers)
  3. Chunrong Fang (71 papers)
  4. Jiawei Liu (156 papers)
  5. Weisong Sun (45 papers)
  6. Haichuan Hu (3 papers)
  7. Qingyu Wang (10 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.