RITFIS: Robust input testing framework for LLMs-based intelligent software (2402.13518v1)
Abstract: The dependence of NLP intelligent software on LLMs is increasingly prominent, underscoring the necessity for robustness testing. Current testing methods focus solely on the robustness of LLM-based software to prompts. Given the complexity and diversity of real-world inputs, studying the robustness of LLMbased software in handling comprehensive inputs (including prompts and examples) is crucial for a thorough understanding of its performance. To this end, this paper introduces RITFIS, a Robust Input Testing Framework for LLM-based Intelligent Software. To our knowledge, RITFIS is the first framework designed to assess the robustness of LLM-based intelligent software against natural language inputs. This framework, based on given threat models and prompts, primarily defines the testing process as a combinatorial optimization problem. Successful test cases are determined by a goal function, creating a transformation space for the original examples through perturbation means, and employing a series of search methods to filter cases that meet both the testing objectives and language constraints. RITFIS, with its modular design, offers a comprehensive method for evaluating the robustness of LLMbased intelligent software. RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software, to the LLM-based software testing scenario. It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation. However, existing methods generally have limitations, especially when dealing with lengthy texts and structurally complex threat models. Therefore, we conducted a comprehensive analysis based on five metrics and provided insightful testing method optimization strategies, benefiting both researchers and everyday users.
- K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, N. Z. Gong, Y. Zhang, et al., “Promptbench: Towards evaluating the robustness of large language models on adversarial prompts,” arXiv preprint arXiv:2306.04528, 2023.
- S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
- L. Espinosa and M. Salathé, “Use of large language models as a scalable approach to understanding public health discourse,” medRxiv, pp. 2024–02, 2024.
- L. Jiang, “Detecting scams using large language models,” arXiv preprint arXiv:2402.03147, 2024.
- M. Davis, S. Choi, S. Estep, B. Myers, and J. Sunshine, “Nanofuzz: A usable tool for automatic test generation,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1114–1126, 2023.
- T. Ouyang, H.-Q. Nguyen-Son, H. H. Nguyen, I. Echizen, and Y. Seo, “Quality assurance of a gpt-based sentiment analysis system: Adversarial review data generation and detection,” arXiv preprint arXiv:2310.05312, 2023.
- Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, Y. Liu, J. Li, B. Xiong, D. Xiong, et al., “Evaluating large language models: A comprehensive survey,” arXiv preprint arXiv:2310.19736, 2023.
- Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms: a survey and guideline for evaluating large language models’ alignment,” arXiv preprint arXiv:2308.05374, 2023.
- Y. Liu, T. Cong, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang, “Robustness over time: Understanding adversarial examples’ effectiveness on longitudinal versions of large language models,” arXiv preprint arXiv:2308.07847, 2023.
- J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi, “Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 119–126, 2020.
- C. B. Head, P. Jasper, M. McConnachie, L. Raftree, and G. Higdon, “Large language model applications for evaluation: Opportunities and ethical implications,” New Directions for Evaluation, vol. 2023, no. 178-179, pp. 33–46, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models, 2023,” URL https://arxiv. org/abs/2307.09288, 2023.
- J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang, H. Huang, W. Ye, X. Geng, et al., “On the robustness of chatgpt: An adversarial and out-of-distribution perspective,” arXiv preprint arXiv:2302.12095, 2023.
- C.-Y. Ko, P.-Y. Chen, P. Das, Y.-S. Chuang, and L. Daniel, “On robustness-accuracy characterization of large language models using synthetic datasets,” in International Conference on Machine Learning, 2023.
- P. Zhang, B. Ren, H. Dong, and Q. Dai, “Cagfuzz: coverage-guided adversarial generative fuzzing testing for image-based deep learning systems,” IEEE Transactions on Software Engineering, vol. 48, no. 11, pp. 4630–4646, 2021.
- Y. Xiao, Y. Lin, I. Beschastnikh, C. Sun, D. Rosenblum, and J. S. Dong, “Repairing failure-inducing inputs with input reflection,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–13, 2022.
- M. Xiao, Y. Xiao, H. Dong, S. Ji, and P. Zhang, “Leap: Efficient and automated test method for nlp software,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1136–1148, IEEE, 2023.
- D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 8018–8025, 2020.
- A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig, “Stress test evaluation for natural language inference,” in Proceedings of the 27th International Conference on Computational Linguistics, pp. 2340–2353, 2018.
- M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of nlp models with checklist,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912, 2020.
- J. Li, S. Ji, T. Du, B. Li, and T. Wang, “Textbugger: Generating adversarial text against real-world applications,” in Proceedings 2019 Network and Distributed System Security Symposium, Internet Society, 2019.
- S. Ren, Y. Deng, K. He, and W. Che, “Generating natural language adversarial examples through probability weighted word saliency,” in Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 1085–1097, 2019.
- P. Malo, A. Sinha, P. Korhonen, J. Wallenius, and P. Takala, “Good debt or bad debt: Detecting semantic orientations in economic texts,” Journal of the Association for Information Science and Technology, vol. 4, no. 65, pp. 782–796, 2014.
- X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in neural information processing systems, vol. 28, 2015.
- X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, and G. Wang, “Text classification via large language models,” arXiv preprint arXiv:2305.08377, 2023.
- C. Li, J. Wang, Y. Zhang, K. Zhu, W. Hou, J. Lian, F. Luo, Q. Yang, and X. Xie, “Large language models understand and can be enhanced by emotional stimuli,” arXiv preprint arXiv:2307.11760, 2023.
- F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt outperforms crowd-workers for text-annotation tasks,” arXiv preprint arXiv:2303.15056, 2023.
- X. He, Z. Lin, Y. Gong, A. Jin, H. Zhang, C. Lin, J. Jiao, S. M. Yiu, N. Duan, W. Chen, et al., “Annollm: Making large language models to be better crowdsourced annotators,” arXiv preprint arXiv:2303.16854, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.