Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

METAL: Metamorphic Testing Framework for Analyzing Large-Language Model Qualities (2312.06056v1)

Published 11 Dec 2023 in cs.SE, cs.AI, and cs.CL

Abstract: Large-LLMs have shifted the paradigm of natural language data processing. However, their black-boxed and probabilistic characteristics can lead to potential risks in the quality of outputs in diverse LLM applications. Recent studies have tested Quality Attributes (QAs), such as robustness or fairness, of LLMs by generating adversarial input texts. However, existing studies have limited their coverage of QAs and tasks in LLMs and are difficult to extend. Additionally, these studies have only used one evaluation metric, Attack Success Rate (ASR), to assess the effectiveness of their approaches. We propose a MEtamorphic Testing for Analyzing LLMs (METAL) framework to address these issues by applying Metamorphic Testing (MT) techniques. This approach facilitates the systematic testing of LLM qualities by defining Metamorphic Relations (MRs), which serve as modularized evaluation metrics. The METAL framework can automatically generate hundreds of MRs from templates that cover various QAs and tasks. In addition, we introduced novel metrics that integrate the ASR method into the semantic qualities of text to assess the effectiveness of MRs accurately. Through the experiments conducted with three prominent LLMs, we have confirmed that the METAL framework effectively evaluates essential QAs on primary LLM tasks and reveals the quality risks in LLMs. Moreover, the newly proposed metrics can guide the optimal MRs for testing each task and suggest the most effective method for generating MRs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  2. OpenAI. (2023) Chatgpt 3.5 turbo api. [Online]. Available: https://openai.com/blog/openai-api
  3. Meta. (2023) Llama 2 api. [Online]. Available: https://ai.meta.com/llama/
  4. F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Schärli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” in International Conference on Machine Learning.   PMLR, 2023, pp. 31 210–31 227.
  5. B. Liu, B. Xiao, X. Jiang, S. Cen, X. He, W. Dou et al., “Adversarial attacks on large language model-based system and mitigating strategies: A case study on chatgpt,” Security and Communication Networks, vol. 2023, 2023.
  6. X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, and J. Gao, “Adversarial training for large neural language models,” arXiv preprint arXiv:2004.08994, 2020.
  7. D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 8018–8025.
  8. B. Wang, C. Xu, X. Liu, Y. Cheng, and B. Li, “Semattack: natural textual attacks via different semantic spaces,” arXiv preprint arXiv:2205.01287, 2022.
  9. F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” arXiv preprint arXiv:2211.09527, 2022.
  10. J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang, H. Huang, W. Ye, X. Geng et al., “On the robustness of chatgpt: An adversarial and out-of-distribution perspective,” arXiv preprint arXiv:2302.12095, 2023.
  11. C.-H. Chiang and H.-y. Lee, “Can large language models be an alternative to human evaluations?” arXiv preprint arXiv:2305.01937, 2023.
  12. X. Huang, W. Ruan, W. Huang, G. Jin, Y. Dong, C. Wu, S. Bensalem, R. Mu, Y. Qi, X. Zhao et al., “A survey of safety and trustworthiness of large language models through the lens of verification and validation,” arXiv preprint arXiv:2305.11391, 2023.
  13. J. Wang, Z. Liu, K. H. Park, M. Chen, and C. Xiao, “Adversarial demonstration attacks on large language models,” arXiv preprint arXiv:2305.14950, 2023.
  14. T. Y. Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. Tse, and Z. Q. Zhou, “Metamorphic testing: A review of challenges and opportunities,” ACM Computing Surveys (CSUR), vol. 51, no. 1, pp. 1–27, 2018.
  15. F. Tambon, G. Laberge, L. An, A. Nikanjam, P. S. N. Mindom, Y. Pequignot, F. Khomh, G. Antoniol, E. Merlo, and F. Laviolette, “How to certify machine learning based safety-critical systems? a systematic literature review,” Automated Software Engineering, vol. 29, no. 2, p. 38, 2022.
  16. B. Wang, C. Xu, S. Wang, Z. Gan, Y. Cheng, J. Gao, A. H. Awadallah, and B. Li, “Adversarial glue: A multi-task benchmark for robustness evaluation of language models,” arXiv preprint arXiv:2111.02840, 2021.
  17. R. Jia and P. Liang, “Adversarial examples for evaluating reading comprehension systems,” arXiv preprint arXiv:1707.07328, 2017.
  18. Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela, “Adversarial nli: A new benchmark for natural language understanding,” arXiv preprint arXiv:1910.14599, 2019.
  19. S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” arXiv preprint arXiv:1508.05326, 2015.
  20. P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” arXiv preprint arXiv:1806.03822, 2018.
  21. R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” arXiv preprint arXiv:1905.07830, 2019.
  22. S. Guo, C. Xie, J. Li, L. Lyu, and T. Zhang, “Threats to pre-trained language models: Survey and taxonomy,” arXiv preprint arXiv:2202.06862, 2022.
  23. H. Li, D. Guo, W. Fan, M. Xu, and Y. Song, “Multi-step jailbreaking privacy attacks on chatgpt,” arXiv preprint arXiv:2304.05197, 2023.
  24. X. Shen, Z. Chen, M. Backes, and Y. Zhang, “In chatgpt we trust? measuring and characterizing the reliability of chatgpt,” arXiv preprint arXiv:2304.08979, 2023.
  25. A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023.
  26. K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, N. Z. Gong, Y. Zhang et al., “Promptbench: Towards evaluating the robustness of large language models on adversarial prompts,” arXiv preprint arXiv:2306.04528, 2023.
  27. P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar et al., “Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110, 2022.
  28. S. Qiu, Q. Liu, S. Zhou, and W. Huang, “Adversarial attack and defense technologies in natural language processing: A survey,” Neurocomputing, vol. 492, pp. 278–307, 2022.
  29. W. Wang, J.-t. Huang, W. Wu, J. Zhang, Y. Huang, S. Li, P. He, and M. R. Lyu, “Mttm: metamorphic testing for textual content moderation software,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).   IEEE, 2023, pp. 2387–2399.
  30. H. B. Braiek and F. Khomh, “On testing machine learning programs,” Journal of Systems and Software, vol. 164, p. 110542, 2020.
  31. P.-O. Côté, A. Nikanjam, R. Bouchoucha, I. Basta, M. Abidi, and F. Khomh, “Quality issues in machine learning software systems,” arXiv preprint arXiv:2306.15007, 2023.
  32. S. H. A. Harbi, L. N. Tidjon, and F. Khomh, “Responsible design patterns for machine learning pipelines,” arXiv preprint arXiv:2306.01788, 2023.
  33. A. Nikanjam, M. M. Morovati, F. Khomh, and H. Ben Braiek, “Faults in deep reinforcement learning programs: a taxonomy and a detection approach,” Automated software engineering, vol. 29, no. 1, p. 8, 2022.
  34. L. N. Tidjon and F. Khomh, “The different faces of ai ethics across the world: A principle-to-practice gap analysis,” IEEE Transactions on Artificial Intelligence, 2022.
  35. A. E. Jim Ormond. (2023) World’s largest association of computing professionals issues principles for generative ai technologies. [Online]. Available: https://www.acm.org/binaries/content/assets/public-policy/principles-generative-ai.pdf
  36. G. A. Lewis, I. Ozkaya, and X. Xu, “Software architecture challenges for ml systems,” in 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME).   IEEE, 2021, pp. 634–638.
  37. A. Aleti, “Software testing of generative ai systems: Challenges and opportunities,” arXiv preprint arXiv:2309.03554, 2023.
  38. L. N. Tidjon and F. Khomh, “Threat assessment in machine learning based systems,” arXiv preprint arXiv:2207.00091, 2022.
  39. S. Martínez-Fernández, J. Bogner, X. Franch, M. Oriol, J. Siebert, A. Trendowicz, A. M. Vollmer, and S. Wagner, “Software engineering for ai-based systems: a survey,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp. 1–59, 2022.
  40. (2023) Mitigating bias in artificial intelligence. [Online]. Available: https://haas.berkeley.edu/equity/industry/playbooks/mitigating-bias-in-ai/
  41. J.-M. John-Mathews, D. Cardon, and C. Balagué, “From reality to world. a critical perspective on ai fairness,” Journal of Business Ethics, vol. 178, no. 4, pp. 945–959, 2022.
  42. (2023) Fairness in machine learning. [Online]. Available: https://github.com/fairlearn/fairlearn/what-we-mean-by-fairness
  43. M. Lee, P. Liang, and Q. Yang, “Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities,” in Proceedings of the 2022 CHI conference on human factors in computing systems, 2022, pp. 1–19.
  44. S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,” arXiv preprint arXiv:2308.02828, 2023.
  45. A. F. Cooper, J. Frankle, and C. De Sa, “Non-determinism and the lawlessness of machine learning code,” in Proceedings of the 2022 Symposium on Computer Science and Law, 2022, pp. 1–8.
  46. W. E. Zhang, Q. Z. Sheng, A. Alhazmi, and C. Li, “Adversarial attacks on deep-learning models in natural language processing: A survey,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 11, no. 3, pp. 1–41, 2020.
  47. M. H. Asyrofi, Z. Yang, I. N. B. Yusuf, H. J. Kang, F. Thung, and D. Lo, “Biasfinder: Metamorphic test generation to uncover bias for sentiment analysis systems,” IEEE Transactions on Software Engineering, vol. 48, no. 12, pp. 5087–5101, 2021.
  48. S. Segura, A. Durán, J. Troya, and A. R. Cortés, “A template-based approach to describing metamorphic relations,” in 2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET).   IEEE, 2017, pp. 3–9.
  49. D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al., “Universal sentence encoder,” arXiv preprint arXiv:1803.11175, 2018.
  50. R. Fagin, R. Kumar, and D. Sivakumar, “Comparing top k lists,” SIAM Journal on discrete mathematics, vol. 17, no. 1, pp. 134–160, 2003.
Citations (1)

Summary

We haven't generated a summary for this paper yet.