Evaluation and Improvement of Fault Detection for Large Language Models (2404.14419v2)
Abstract: LLMs have recently achieved significant success across various application domains, garnering substantial attention from different communities. Unfortunately, even for the best LLM, many \textit{faults} still exist that LLM cannot properly predict. Such faults will harm the usability of LLMs in general and could introduce safety issues in reliability-critical systems such as autonomous driving systems. How to quickly reveal these faults in real-world datasets that LLM could face is important, but challenging. The major reason is that the ground truth is necessary but the data labeling process is heavy considering the time and human effort. To handle this problem, in the conventional deep learning testing field, test selection methods have been proposed for efficiently evaluating deep learning models by prioritizing faults. However, despite their importance, the usefulness of these methods on LLMs is unclear, and lack of exploration. In this paper, we conduct the first empirical study to investigate the effectiveness of existing fault detection methods for LLMs. Experimental results on four different tasks~(including both code tasks and natural language processing tasks) and four LLMs~(e.g., LLaMA3 and GPT4) demonstrated that simple methods such as Margin perform well on LLMs but there is still a big room for improvement. Based on the study, we further propose \textbf{MuCS}, a prompt \textbf{Mu}tation-based prediction \textbf{C}onfidence \textbf{S}moothing framework to boost the fault detection capability of existing methods. Concretely, multiple prompt mutation techniques have been proposed to help collect more diverse outputs for confidence smoothing. The results show that our proposed framework significantly enhances existing methods with the improvement of test relative coverage by up to 70.53\%.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Jehad Al Dallal and Anas Abdin. 2017. Empirical evaluation of the impact of object-oriented code refactoring on quality attributes: A systematic literature review. IEEE Transactions on Software Engineering 44, 1 (2017), 44–69.
- DeepAbstraction++: Enhancing Test Prioritization Performance via Combined Parameterized Boxes. In International Conference on Bridging the Gap between AI and Reality. Springer, 77–93.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (2023).
- Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208 (2023).
- How Robust is Google’s Bard to Adversarial Image Attacks? arXiv preprint arXiv:2309.11751 (2023).
- Mixcode: Enhancing code classification by mixup-based data augmentation. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 379–390.
- Evaluating Large Language Models in Class-Level Code Generation. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, 865–865.
- DeepGini: prioritizing massive tests to enhance the robustness of deep neural networks (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 177–188. https://doi.org/10.1145/3395363.3397357
- Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning. PMLR, 1050–1059.
- Adaptive test selection for deep neural networks (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 73–85. https://doi.org/10.1145/3510003.3510232
- On calibration of modern neural networks. In International conference on machine learning. PMLR, 1321–1330.
- Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745 (2011).
- Test Optimization in DNN Testing: A Survey. ACM Trans. Softw. Eng. Methodol. (jan 2024). https://doi.org/10.1145/3643678
- Evaluating the robustness of test selection methods for deep neural networks. arXiv preprint arXiv:2308.01314 (2023).
- Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236 (2023).
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664 (2023).
- StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
- TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 20874–20886. https://proceedings.neurips.cc/paper_files/paper/2021/file/ae78510109d46b0a6eef9820a4ca95d6-Paper.pdf
- CCTest: Testing and Repairing Code Completion Systems. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1238–1250. https://doi.org/10.1109/ICSE48619.2023.00110
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024).
- The Scope of ChatGPT in Software Engineering: A Thorough Investigation. arXiv preprint arXiv:2305.12138 (2023).
- Test selection for deep learning systems. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 2 (2021), 1–22.
- Vukosi Marivate and Tshephisho Sefara. 2020. Improving short text classification through global augmentation methods. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Springer, 385–399.
- George A. Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (nov 1995), 39–41. https://doi.org/10.1145/219717.219748
- HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization. arXiv preprint arXiv:2402.16694 (2024).
- CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/a5bfc9e07964f8dddeb95fc584cd965d-Abstract-round2.html
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
- Towards a Big Data Curated Benchmark of Inter-project Code Clones. In 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014. IEEE Computer Society, 476–480. https://doi.org/10.1109/ICSME.2014.77
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3, 6 (2023), 7.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Classification of Short Texts by Deploying Topical Annotations. In Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1-5, 2012. Proceedings (Lecture Notes in Computer Science, Vol. 7224), Ricardo Baeza-Yates, Arjen P. de Vries, Hugo Zaragoza, Berkant Barla Cambazoglu, Vanessa Murdock, Ronny Lempel, and Fabrizio Silvestri (Eds.). Springer, 376–387. https://doi.org/10.1007/978-3-642-28997-2_32
- Prioritizing Test Inputs for Deep Neural Networks via Mutation Analysis. In Proceedings of the 43rd International Conference on Software Engineering (Madrid, Spain) (ICSE ’21). IEEE Press, 397–409. https://doi.org/10.1109/ICSE43902.2021.00046
- A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
- Dynamic Data Fault Localization for Deep Neural Networks. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1345–1357.
- Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240 (2023).
- Sentiment Analysis in the Era of Large Language Models: A Reality Check. arXiv preprint arXiv:2305.15005 (2023).
- Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023).
- CertPri: certifiable prioritization for deep neural networks via movement cost in feature space. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1–13.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
- Openmix: Exploring outlier samples for misclassification detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12074–12083.
- Rethinking confidence calibration for failure prediction. arXiv preprint arXiv:2303.02970 (2023).