DeepSample: DNN sampling-based testing for operational accuracy assessment (2403.19271v1)
Abstract: Deep Neural Networks (DNN) are core components for classification and regression tasks of many software systems. Companies incur in high costs for testing DNN with datasets representative of the inputs expected in operation, as these need to be manually labelled. The challenge is to select a representative set of test inputs as small as possible to reduce the labelling cost, while sufficing to yield unbiased high-confidence estimates of the expected DNN accuracy. At the same time, testers are interested in exposing as many DNN mispredictions as possible to improve the DNN, ending up in the need for techniques pursuing a threefold aim: small dataset size, trustworthy estimates, mispredictions exposure. This study presents DeepSample, a family of DNN testing techniques for cost-effective accuracy assessment based on probabilistic sampling. We investigate whether, to what extent, and under which conditions probabilistic sampling can help to tackle the outlined challenge. We implement five new sampling-based testing techniques, and perform a comprehensive comparison of such techniques and of three further state-of-the-art techniques for both DNN classification and regression tasks. Results serve as guidance for best use of sampling-based testing for faithful and high-confidence estimates of DNN accuracy in operation at low cost.
- Boosting Operational DNN Testing Efficiency through Conditioning. In Proc. 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 499–509. ACM, 2019.
- Operation is the Hardest Teacher: Estimating DNN Accuracy Looking for Mispredictions. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 348–358. IEEE, 2021.
- Certifying the reliability of software. IEEE Transactions on Software Engineering, SE-12(1):3–11, 1986.
- Cleanroom software engineering. IEEE Software, 4(55):19–24, 1987.
- A case study in cleanroom software engineering: the IBM COBOL Structuring Facility. In 12th International Computer Software and Applications Conference (COMPSAC), pages 10–17. IEEE, 1988.
- Engineering software under statistical quality control. IEEE Software, 7(6):45–54, 1990.
- J. D. Musa. Software reliability-engineered testing. Computer, 29(11):61–68, 1996.
- A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):pp. 663–685, 1952.
- Estimating confidence interval of software reliability with adaptive testing strategy. Journal of Systems and Software, 97:192–206, 2014.
- An experimental study of adaptive testing for software reliability assessment. Journal of Systems and Software, 81(8):1406–1429, 2008.
- Optimal and adaptive testing for software reliability assessment. Information and Software Technology, 46(15):989–1000, 2004.
- On the asymptotic behavior of adaptive testing strategy for software reliability assessment. IEEE Transactions on Software Engineering, 40(4):396–412, 2014.
- Estimation of software reliability by stratified sampling, 1999.
- F.b.N. Omri. Weighted statistical white-box testing with proportional-optimal stratification. In Proc. 19th International Doctoral Symposium on Components and Architecture, WCOP’14, pages 19–24. ACM, 2014.
- RELAI Testing: A Technique to Assess and Improve Software Reliability. IEEE Transactions on Software Engineering, 42(5):452–475, 2016.
- R. Pietrantuono and S. Russo. Probabilistic sampling-based testing for accelerated reliability assessment. In IEEE International Conference on Software Quality, Reliability and Security (QRS), pages 35–46. IEEE, 2018.
- R. Pietrantuono and S. Russo. On adaptive sampling-based testing for software reliability assessment. In 27th International Symposium on Software Reliability Engineering, ISSRE, pages 1–11. IEEE, 2016.
- Practical accuracy estimation for efficient deep neural network testing. ACM Trans. Softw. Eng. Methodol., 29(4), oct 2020.
- Cost-effective testing of a deep learning model through input reduction. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), pages 289–300, 2020.
- S. L. Lohr. Sampling: Design and Analysis. Duxbury Press, 2nd edition, 2009.
- Misbehaviour prediction for autonomous driving systems. In Proc. of the IEEE/ACM 42nd International Conference on Software Engineering (ICSE), pages 359–371. ACM, 2020.
- Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the 41st International Conference on Software Engineering (ICSE), pages 1039–1049. IEEE, 2019.
- Kernel smoothing. CRC press, 1994.
- On the Theory of Sampling from Finite Populations. The Annals of Mathematical Statistics, 14(4):333–362, 1943.
- On a simple procedure of unequal probability sampling without replacement. Journal of the Royal Statistical Society. Series B (Methodological), 24(2):482–491, 1962.
- J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
- Y. LeCun and C. Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/, 2010.
- A. Krizhevsky. Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto, 2009.
- DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Communications of the ACM, 62(11):137–145, 2019.
- Do ImageNet Classifiers Generalize to ImageNet? In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of Machine Learning Research (PMLR), volume 97, pages 5389–5400, 2019.
- Assessing operational accuracy of cnn-based image classifiers using an oracle surrogate. Intelligent Systems with Applications, 17:200172, 2023.
- Approximations of the critical region of the fbietkan statistic. Communications in Statistics - Theory and Methods, 9(6):571–595, 1980.
- A. Dinno. Nonparametric pairwise multiple comparisons in independent groups using dunn’s test. The Stata Journal, 15(1):292–300, 2015.
- Iterative assessment and improvement of dnn operational accuracy. In 2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), pages 43–48. IEEE, 2023.