Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks

Published 8 Mar 2023 in cs.LG, cs.PF, and cs.SE | (2303.04878v5)

Abstract: Deep neural networks (DNNs) are widely used in various application domains such as image processing, speech recognition, and natural language processing. However, testing DNN models may be challenging due to the complexity and size of their input domain. Particularly, testing DNN models often requires generating or exploring large unlabeled datasets. In practice, DNN test oracles, which identify the correct outputs for inputs, often require expensive manual effort to label test data, possibly involving multiple experts to ensure labeling correctness. In this paper, we propose DeepGD, a black-box multi-objective test selection approach for DNN models. It reduces the cost of labeling by prioritizing the selection of test inputs with high fault revealing power from large unlabeled datasets. DeepGD not only selects test inputs with high uncertainty scores to trigger as many mispredicted inputs as possible but also maximizes the probability of revealing distinct faults in the DNN model by selecting diverse mispredicted inputs. The experimental results conducted on four widely used datasets and five DNN models show that in terms of fault-revealing ability: (1) White-box, coverage-based approaches fare poorly, (2) DeepGD outperforms existing black-box test selection approaches in terms of fault detection, and (3) DeepGD also leads to better guidance for DNN model retraining when using selected inputs to augment the training set.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. M. Alam, M. D. Samad, L. Vidyaratne, A. Glandon, and K. M. Iftekharuddin, “Survey on deep neural networks in speech and vision systems,” Neurocomputing, vol. 417, pp. 302–321, 2020.
  2. G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
  3. S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, “A survey of deep learning techniques for autonomous driving,” Journal of Field Robotics, vol. 37, no. 3, pp. 362–386, 2020.
  4. F. Tambon, G. Laberge, L. An, A. Nikanjam, P. S. N. Mindom, Y. Pequignot, F. Khomh, G. Antoniol, E. Merlo, and F. Laviolette, “How to certify machine learning based safety-critical systems? a systematic literature review,” Automated Software Engineering, vol. 29, no. 2, p. 38, 2022.
  5. J. Chen, Z. Wu, Z. Wang, H. You, L. Zhang, and M. Yan, “Practical accuracy estimation for efficient deep neural network testing,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 29, no. 4, pp. 1–35, 2020.
  6. A. Zolfagharian, M. Abdellatif, L. C. Briand, and R. S, “Smarla: A safety monitoring approach for deep reinforcement learning agents,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2308.02594
  7. K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” in proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 1–18.
  8. L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang, “Deepgauge: Multi-granularity testing criteria for deep learning systems,” 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 120–131, 2018.
  9. Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing of deep-neural-network-driven autonomous cars,” in Proceedings of the 40th International Conference on Software Engineering, ser. ICSE ’18.   New York, NY, USA: Association for Computing Machinery, 2018, p. 303–314. [Online]. Available: https://doi.org/10.1145/3180155.3180220
  10. J. Kim, R. Feldt, and S. Yoo, “Guiding deep learning system testing using surprise adequacy,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).   IEEE, 2019, pp. 1039–1049.
  11. Y. Sun, X. Huang, D. Kroening, J. Sharp, M. Hill, and R. Ashmore, “Structural test coverage criteria for deep neural networks,” ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 5s, pp. 1–23, 2019.
  12. X. Gao, Y. Feng, Y. Yin, Z. Liu, Z. Chen, and B. Xu, “Adaptive test selection for deep neural networks,” in Proceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 73–85. [Online]. Available: https://doi.org/10.1145/3510003.3510232
  13. W. Ma, M. Papadakis, A. Tsakmalis, M. Cordy, and Y. L. Traon, “Test selection for deep learning systems,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 30, no. 2, pp. 1–22, 2021.
  14. Q. Hu, Y. Guo*, M. Cordy, X. Xie, L. Ma, M. Papadakis, and Y. Le Traon, “An empirical study on data distribution-aware test selection for deep learning enhancement,” ACM Transactions on Software Engineering and Methodology, 2022.
  15. H. Hemmati, “How effective are code coverage criteria?” in 2015 IEEE International Conference on Software Quality, Reliability and Security.   IEEE, 2015, pp. 151–156.
  16. J. Chen, M. Yan, Z. Wang, Y. Kang, and Z. Wu, “Deep neural network test coverage: How far are we?” arXiv preprint arXiv:2010.04946, 2020.
  17. Z. Aghababaeyan, M. Abdellatif, L. Briand, R. S, and M. Bagherzadeh, “Black-box testing of deep neural networks through test case diversity,” IEEE Transactions on Software Engineering, vol. 49, no. 5, pp. 3182–3204, 2023.
  18. Z. Li, X. Ma, C. Xu, and C. Cao, “Structural coverage criteria for neural networks could be misleading,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).   IEEE, 2019, pp. 89–92.
  19. Z. Yang, J. Shi, M. H. Asyrofi, and D. Lo, “Revisiting neuron coverage metrics and quality of deep neural networks,” in 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).   IEEE, 2022, pp. 408–419.
  20. Y. Sun, X. Huang, D. Kroening, J. Sharp, M. Hill, and R. Ashmore, “Testing deep neural networks,” arXiv preprint arXiv:1803.04792, 2018.
  21. Y. Feng, Q. Shi, X. Gao, J. Wan, C. Fang, and Z. Chen, “Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2020, pp. 177–188.
  22. Z. Li, X. Ma, C. Xu, C. Cao, J. Xu, and J. Lü, “Boosting operational dnn testing efficiency through conditioning,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 499–509.
  23. M. Biagiola, A. Stocco, F. Ricca, and P. Tonella, “Diversity-based web test generation,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 142–153.
  24. H. Hemmati, Z. Fang, and M. V. Mantyla, “Prioritizing manual test cases in traditional and rapid release environments,” in 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST).   IEEE, 2015, pp. 1–10.
  25. R. Feldt, S. Poulding, D. Clark, and S. Yoo, “Test set diameter: Quantifying the diversity of sets of test cases,” in 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).   IEEE, 2016, pp. 223–233.
  26. C. Zhao, Y. Mu, X. Chen, J. Zhao, X. Ju, and G. Wang, “Can test input selection methods for deep neural network guarantee test diversity? a large-scale empirical study,” Information and Software Technology, vol. 150, p. 106982, 2022.
  27. A. Kulesza and B. Taskar, “Determinantal point processes for machine learning,” arXiv preprint arXiv:1207.6083, 2012.
  28. K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002.
  29. H. Fahmy, F. Pastore, M. Bagherzadeh, and L. Briand, “Supporting deep neural network safety analysis and retraining through heatmap-based unsupervised learning,” IEEE Transactions on Reliability, 2021.
  30. A. Zolfagharian, M. Abdellatif, L. C. Briand, M. Bagherzadeh, and R. S, “A search-based testing approach for deep reinforcement learning agents,” IEEE Transactions on Software Engineering, vol. 49, no. 7, pp. 3715–3735, 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10107813
  31. J. Pachouly, S. Ahirrao, K. Kotecha, G. Selvachandran, and A. Abraham, “A systematic literature review on software defect prediction using artificial intelligence: Datasets, data validation methods, approaches, and tools,” Engineering Applications of Artificial Intelligence, vol. 111, p. 104773, 2022.
  32. T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, “A systematic literature review on fault prediction performance in software engineering,” IEEE Transactions on Software Engineering, vol. 38, no. 6, pp. 1276–1304, 2011.
  33. R. Pan, T. A. Ghaleb, and L. Briand, “Atm: Black-box test case minimization based on test code similarity and evolutionary search,” in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23.   IEEE Press, 2023, p. 1700–1711. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00146
  34. R. Pan, M. Bagherzadeh, T. A. Ghaleb, and L. Briand, “Test case selection and prioritization using machine learning: a systematic literature review,” Empirical Softw. Engg., vol. 27, no. 2, mar 2022. [Online]. Available: https://doi.org/10.1007/s10664-021-10066-6
  35. T. Zohdinasab, V. Riccio, A. Gambi, and P. Tonella, “Deephyperion: exploring the feature space of deep learning-based systems through illumination search,” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021, pp. 79–90.
  36. H. Hemmati, A. Arcuri, and L. Briand, “Achieving scalable model-based testing through test case diversity,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 22, no. 1, pp. 1–42, 2013.
  37. S. Wang, S. Ali, and A. Gotlieb, “Cost-effective test suite minimization in product lines using search techniques,” Journal of Systems and Software, vol. 103, pp. 370–391, 2015.
  38. A. Panichella, F. M. Kifetew, and P. Tonella, “Reformulating branch coverage as a many-objective optimization problem,” in 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST), 2015, pp. 1–10.
  39. M. Weiss and P. Tonella, “Simple techniques work surprisingly well for neural network test prioritization and active learning (replicability study),” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 139–150.
  40. B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” Advances in neural information processing systems, vol. 27, pp. 2069–2077, 2014.
  41. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  42. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  43. S. Sharma, K. Guleria, S. Tiwari, and S. Kumar, “A deep learning based convolutional neural network model with vgg16 feature extractor for the detection of alzheimer disease using mri scans,” Measurement: Sensors, vol. 24, p. 100506, 2022.
  44. W. Mousser and S. Ouadfel, “Deep feature extraction for pap-smear image classification: A comparative study,” in Proceedings of the 2019 5th International Conference on Computer and Technology Applications, 2019, pp. 6–10.
  45. T. Kaur and T. K. Gandhi, “Automated brain image classification based on vgg-16 and transfer learning,” in 2019 International Conference on Information Technology (ICIT).   IEEE, 2019, pp. 94–98.
  46. L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
  47. Q. Yuchi, N. Wang, S. Li, Z. Yang, and B. Jiang, “A bi-objective reverse logistics network design under the emission trading scheme,” IEEE Access, vol. 7, pp. 105 072–105 085, 2019.
  48. Y. Ma, B. Li, W. Huang, and Q. Fan, “An improved nsga-ii based on multi-task optimization for multi-uav maritime search and rescue under severe weather,” Journal of Marine Science and Engineering, vol. 11, no. 4, p. 781, 2023.
  49. S. Verma, M. Pant, and V. Snasel, “A comprehensive review on nsga-ii for multi-objective combinatorial optimization problems,” Ieee Access, vol. 9, pp. 57 757–57 791, 2021.
  50. B. L. Miller and D. E. Goldberg, “Genetic algorithms, selection schemes, and the varying effects of noise,” Evol. Comput., vol. 4, no. 2, p. 113–131, jun 1996. [Online]. Available: https://doi.org/10.1162/evco.1996.4.2.113
  51. J. Branke, K. Deb, H. Dierolf, and M. Osswald, “Finding knees in multi-objective optimization,” in Parallel Problem Solving from Nature-PPSN VIII: 8th International Conference, Birmingham, UK, September 18-22, 2004. Proceedings 8.   Springer, 2004, pp. 722–731.
  52. S. Messaoudi, A. Panichella, D. Bianculli, L. Briand, and R. Sasnauskas, “A search-based approach for accurate identification of log message formats,” in 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).   IEEE, 2018, pp. 167–16 710.
  53. L. C. Briand, Y. Labiche, and M. Shousha, “Using genetic algorithms for early schedulability analysis and stress testing in real-time systems,” Genetic Programming and Evolvable Machines, vol. 7, no. 2, pp. 145–170, 2006.
  54. H. G. Cobb and J. J. Grefenstette, “Genetic algorithms for tracking changing environments.” Naval Research Lab Washington DC, Tech. Rep., 1993.
  55. J. Blank and K. Deb, “Pymoo: Multi-objective optimization in python,” IEEE Access, vol. 8, pp. 89 497–89 509, 2020.
  56. M. Attaoui, H. Fahmy, F. Pastore, and L. Briand, “Black-box safety analysis and retraining of dnns based on feature extraction and clustering,” ACM Trans. Softw. Eng. Methodol., vol. 32, no. 3, apr 2023. [Online]. Available: https://doi.org/10.1145/3550271
  57. T. Y. Chen, F.-C. Kuo, R. G. Merkel, and T. Tse, “Adaptive random testing: The art of test case diversity,” Journal of Systems and Software, vol. 83, no. 1, pp. 60–66, 2010.
  58. L. McInnes, J. Healy, and S. Astels, “hdbscan: Hierarchical density based clustering,” Journal of Open Source Software, vol. 2, no. 11, p. 205, 2017.
  59. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
  60. S. Gerasimou, H. F. Eniser, A. Sen, and A. Cakan, “Importance-driven deep learning system testing,” in 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).   IEEE, 2020, pp. 702–713.
  61. Z. Aghababaeyan, M. Abdellatif, M. Dadkhah, and L. Briand, “Replication package of deepgd,” https://github.com/ZOE-CA/DeepGD, 2024.
  62. T. W. MacFarland, J. M. Yates, T. W. MacFarland, and J. M. Yates, “Wilcoxon matched-pairs signed-ranks test,” Introduction to Nonparametric statistics for the biological sciences using R, pp. 133–175, 2016.
  63. A. Arrieta, “Multi-objective metamorphic follow-up test case selection for deep learning systems,” in Proceedings of the Genetic and Evolutionary Computation Conference, 2022, pp. 1327–1335.
  64. F. Harel-Canada, L. Wang, M. A. Gulzar, Q. Gu, and M. Kim, “Is neuron coverage a meaningful measure for testing deep neural networks?” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 851–862.
  65. Y. Dong, P. Zhang, J. Wang, S. Liu, J. Sun, J. Hao, X. Wang, L. Wang, J. Dong, and T. Dai, “An empirical study on correlation between coverage and robustness for deep neural networks,” in 2020 25th International Conference on Engineering of Complex Computer Systems (ICECCS).   IEEE, 2020, pp. 73–82.
  66. S. Yan, G. Tao, X. Liu, J. Zhai, S. Ma, L. Xu, and X. Zhang, “Correlations between deep neural network model coverage criteria and model quality,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 775–787.
  67. J. Zhou, F. Li, J. Dong, H. Zhang, and D. Hao, “Cost-effective testing of a deep learning model through input reduction,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE).   IEEE, 2020, pp. 289–300.
  68. Z. Gong, P. Zhong, and W. Hu, “Diversity in machine learning,” IEEE Access, vol. 7, pp. 64 323–64 350, 2019.
  69. A. R. Cohen and P. M. Vitányi, “Normalized compression distance of multisets with applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 8, pp. 1602–1614, 2014.
  70. X. Zhan, Q. Wang, K. hao Huang, H. Xiong, D. Dou, and A. B. Chan, “A comparative survey of deep active learning,” 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2203.13450
  71. C. Shui, F. Zhou, C. Gagné, and B. Wang, “Deep active learning: Unified and principled method for query and training,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2020, pp. 1308–1318.
  72. J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal, “Deep batch active learning by diverse, uncertain gradient lower bounds,” arXiv preprint arXiv:1906.03671, 2019.
Citations (11)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.