Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 138 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

RITFIS: Robust input testing framework for LLMs-based intelligent software (2402.13518v1)

Published 21 Feb 2024 in cs.SE and cs.CL

Abstract: The dependence of NLP intelligent software on LLMs is increasingly prominent, underscoring the necessity for robustness testing. Current testing methods focus solely on the robustness of LLM-based software to prompts. Given the complexity and diversity of real-world inputs, studying the robustness of LLMbased software in handling comprehensive inputs (including prompts and examples) is crucial for a thorough understanding of its performance. To this end, this paper introduces RITFIS, a Robust Input Testing Framework for LLM-based Intelligent Software. To our knowledge, RITFIS is the first framework designed to assess the robustness of LLM-based intelligent software against natural language inputs. This framework, based on given threat models and prompts, primarily defines the testing process as a combinatorial optimization problem. Successful test cases are determined by a goal function, creating a transformation space for the original examples through perturbation means, and employing a series of search methods to filter cases that meet both the testing objectives and language constraints. RITFIS, with its modular design, offers a comprehensive method for evaluating the robustness of LLMbased intelligent software. RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software, to the LLM-based software testing scenario. It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation. However, existing methods generally have limitations, especially when dealing with lengthy texts and structurally complex threat models. Therefore, we conducted a comprehensive analysis based on five metrics and provided insightful testing method optimization strategies, benefiting both researchers and everyday users.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, N. Z. Gong, Y. Zhang, et al., “Promptbench: Towards evaluating the robustness of large language models on adversarial prompts,” arXiv preprint arXiv:2306.04528, 2023.
  2. S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
  3. L. Espinosa and M. Salathé, “Use of large language models as a scalable approach to understanding public health discourse,” medRxiv, pp. 2024–02, 2024.
  4. L. Jiang, “Detecting scams using large language models,” arXiv preprint arXiv:2402.03147, 2024.
  5. M. Davis, S. Choi, S. Estep, B. Myers, and J. Sunshine, “Nanofuzz: A usable tool for automatic test generation,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1114–1126, 2023.
  6. T. Ouyang, H.-Q. Nguyen-Son, H. H. Nguyen, I. Echizen, and Y. Seo, “Quality assurance of a gpt-based sentiment analysis system: Adversarial review data generation and detection,” arXiv preprint arXiv:2310.05312, 2023.
  7. Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, Y. Liu, J. Li, B. Xiong, D. Xiong, et al., “Evaluating large language models: A comprehensive survey,” arXiv preprint arXiv:2310.19736, 2023.
  8. Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms: a survey and guideline for evaluating large language models’ alignment,” arXiv preprint arXiv:2308.05374, 2023.
  9. Y. Liu, T. Cong, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang, “Robustness over time: Understanding adversarial examples’ effectiveness on longitudinal versions of large language models,” arXiv preprint arXiv:2308.07847, 2023.
  10. J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi, “Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 119–126, 2020.
  11. C. B. Head, P. Jasper, M. McConnachie, L. Raftree, and G. Higdon, “Large language model applications for evaluation: Opportunities and ethical implications,” New Directions for Evaluation, vol. 2023, no. 178-179, pp. 33–46, 2023.
  12. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models, 2023,” URL https://arxiv. org/abs/2307.09288, 2023.
  13. J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang, H. Huang, W. Ye, X. Geng, et al., “On the robustness of chatgpt: An adversarial and out-of-distribution perspective,” arXiv preprint arXiv:2302.12095, 2023.
  14. C.-Y. Ko, P.-Y. Chen, P. Das, Y.-S. Chuang, and L. Daniel, “On robustness-accuracy characterization of large language models using synthetic datasets,” in International Conference on Machine Learning, 2023.
  15. P. Zhang, B. Ren, H. Dong, and Q. Dai, “Cagfuzz: coverage-guided adversarial generative fuzzing testing for image-based deep learning systems,” IEEE Transactions on Software Engineering, vol. 48, no. 11, pp. 4630–4646, 2021.
  16. Y. Xiao, Y. Lin, I. Beschastnikh, C. Sun, D. Rosenblum, and J. S. Dong, “Repairing failure-inducing inputs with input reflection,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–13, 2022.
  17. M. Xiao, Y. Xiao, H. Dong, S. Ji, and P. Zhang, “Leap: Efficient and automated test method for nlp software,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1136–1148, IEEE, 2023.
  18. D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 8018–8025, 2020.
  19. A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig, “Stress test evaluation for natural language inference,” in Proceedings of the 27th International Conference on Computational Linguistics, pp. 2340–2353, 2018.
  20. M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of nlp models with checklist,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912, 2020.
  21. J. Li, S. Ji, T. Du, B. Li, and T. Wang, “Textbugger: Generating adversarial text against real-world applications,” in Proceedings 2019 Network and Distributed System Security Symposium, Internet Society, 2019.
  22. S. Ren, Y. Deng, K. He, and W. Che, “Generating natural language adversarial examples through probability weighted word saliency,” in Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 1085–1097, 2019.
  23. P. Malo, A. Sinha, P. Korhonen, J. Wallenius, and P. Takala, “Good debt or bad debt: Detecting semantic orientations in economic texts,” Journal of the Association for Information Science and Technology, vol. 4, no. 65, pp. 782–796, 2014.
  24. X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in neural information processing systems, vol. 28, 2015.
  25. X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, and G. Wang, “Text classification via large language models,” arXiv preprint arXiv:2305.08377, 2023.
  26. C. Li, J. Wang, Y. Zhang, K. Zhu, W. Hou, J. Lian, F. Luo, Q. Yang, and X. Xie, “Large language models understand and can be enhanced by emotional stimuli,” arXiv preprint arXiv:2307.11760, 2023.
  27. F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt outperforms crowd-workers for text-annotation tasks,” arXiv preprint arXiv:2303.15056, 2023.
  28. X. He, Z. Lin, Y. Gong, A. Jin, H. Zhang, C. Lin, J. Jiao, S. M. Yiu, N. Duan, W. Chen, et al., “Annollm: Making large language models to be better crowdsourced annotators,” arXiv preprint arXiv:2303.16854, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.