DeepKnowledge: Generalisation-Driven Deep Learning Testing (2403.16768v1)
Abstract: Despite their unprecedented success, DNNs are notoriously fragile to small shifts in data distribution, demanding effective testing techniques that can assess their dependability. Despite recent advances in DNN testing, there is a lack of systematic testing approaches that assess the DNN's capability to generalise and operate comparably beyond data in their training distribution. We address this gap with DeepKnowledge, a systematic testing methodology for DNN-based systems founded on the theory of knowledge generalisation, which aims to enhance DNN robustness and reduce the residual risk of 'black box' models. Conforming to this theory, DeepKnowledge posits that core computational DNN units, termed Transfer Knowledge neurons, can generalise under domain shift. DeepKnowledge provides an objective confidence measurement on testing activities of DNN given data distribution shifts and uses this information to instrument a generalisation-informed test adequacy criterion to check the transfer knowledge capacity of a test set. Our empirical evaluation of several DNNs, across multiple datasets and state-of-the-art adversarial generation techniques demonstrates the usefulness and effectiveness of DeepKnowledge and its ability to support the engineering of more dependable DNNs. We report improvements of up to 10 percentage points over state-of-the-art coverage criteria for detecting adversarial attacks on several benchmarks, including MNIST, SVHN, and CIFAR.
- Safeml: safety monitoring of machine learning classifiers through statistical difference measures, in: International Symposium on Model-Based Safety and Assessment, Springer. pp. 197–211.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10, e0130140.
- Driver error or designer error: Using the perceptual cycle model to explore the circumstances surrounding the fatal tesla crash on 7th may 2016. Safety science 108, 278–285.
- The oracle problem in software testing: A survey. IEEE transactions on software engineering 41, 507–525.
- End to end learning for self-driving cars.
- Making the case for safety of machine learning in highly automated driving, in: International Conference on Computer Safety, Reliability, and Security, Springer. pp. 5–16.
- Emnist: Extending mnist to handwritten letters, in: 2017 international joint conference on neural networks (IJCNN), IEEE. pp. 2921–2926.
- The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29, 141–142.
- Universal adversarial attack via enhanced projected gradient descent, in: 2020 IEEE International Conference on Image Processing (ICIP), IEEE. pp. 1241–1245.
- Boosting adversarial attacks with momentum, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9185–9193.
- Deepfault: Fault localization for deep neural networks, in: International Conference on Fundamental Approaches to Software Engineering, Springer. pp. 171–191.
- Visualizing higher-layer features of a deep network. University of Montreal 1341, 1.
- Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks, in: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 177–188.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning, in: international conference on machine learning, PMLR. pp. 1050–1059.
- Importance-driven deep learning system testing, in: 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), IEEE. pp. 702–713.
- Deep Learning. MIT Press. http://www.deeplearningbook.org.
- The challenge of verification and testing of machine learning.
- Generative adversarial networks. Communications of the ACM 63, 139–144.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 .
- DeepSafe: A data-driven approach for checking adversarial robustness in neural networks. arXiv preprint arXiv:1710.00486 .
- Recent advances in convolutional neural networks. Pattern Recognition 77, 354–377.
- Deep learning with Keras. Packt Publishing Ltd.
- Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods. American Journal of Roentgenology 212, 38–43.
- The many faces of robustness: A critical analysis of out-of-distribution generalization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349.
- Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 82–97.
- Detection of traffic signs in real-world images: The german traffic sign detection benchmark, in: The 2013 international joint conference on neural networks (IJCNN), Ieee. pp. 1–8.
- A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability. Computer Science Review 37, 100270.
- Safety verification of deep neural networks, in: Computer Aided Verification: 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30, Springer. pp. 3–29.
- Pragmatic unit testing in c# with nunit. The Pragmatic Programmers.
- Kullback-leibler divergence, in: International encyclopedia of statistical science. Springer, pp. 720–722.
- Policy compression for aircraft collision avoidance systems, in: IEEE Digital Avionics Systems Conference (DASC), pp. 1–10.
- Generalization in deep learning. arXiv preprint arXiv:1710.05468 .
- Guiding deep learning system testing using surprise adequacy, in: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE. pp. 1039–1049.
- Repairing dnn architecture: Are we there yet?, in: 2023 IEEE Conference on Software Testing, Verification and Validation (ICST), IEEE. pp. 234–245.
- Convolutional deep belief networks on cifar-10. Unpublished manuscript 40, 1–9.
- Learning multiple layers of features from tiny images .
- Multi-layer perceptrons, in: Computational Intelligence. Springer, pp. 53–124.
- Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 .
- The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/ .
- Deep learning. Nature 521, 436–444.
- Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34, 705–724.
- The global k-means clustering algorithm. Pattern recognition 36, 451–461.
- A survey on deep learning in medical image analysis. Medical Image Analysis 42, 60–88.
- Sensitivity of adversarial perturbation in fast gradient sign method, in: 2019 IEEE symposium series on computational intelligence (SSCI), IEEE. pp. 433–436.
- From zero-shot learning to conventional supervised classification: Unseen visual data synthesis, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1627–1636.
- Deepgauge: Multi-granularity testing criteria for deep learning systems, in: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 120–131.
- DeepGauge: Multi-granularity testing criteria for deep learning systems, in: IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 120–131.
- Test selection for deep learning systems. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 1–22.
- Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. arXiv preprint arXiv:1911.02969 .
- The jensen-shannon divergence. Journal of the Franklin Institute 334, 307–318.
- xUnit test patterns: Refactoring test code. Pearson Education.
- Recurrent neural network based language model., in: Interspeech, Makuhari. pp. 1045–1048.
- Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, 211–222.
- Reading digits in natural images with unsupervised feature learning .
- Exploring generalization in deep learning. Advances in neural information processing systems 30.
- cleverhans v2. 0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768 10.
- The limitations of deep learning in adversarial settings, in: International Symposium on Security and Privacy (S&P), pp. 372–387.
- Deepxplore: Automated whitebox testing of deep learning systems, in: proceedings of the 26th Symposium on Operating Systems Principles, pp. 1–18.
- Practical combinatorial interaction testing: Empirical findings on efficiency and early fault detection. IEEE Transactions on Software Engineering 41, 901–924.
- Software testing and analysis: process, principles, and techniques. John Wiley & Sons.
- You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788.
- Generalization Bounds. Springer US, Boston, MA. pp. 447–454. URL: https://doi.org/10.1007/978-0-387-30164-8_328, doi:10.1007/978-0-387-30164-8_328.
- Tx-ray: Quantifying and explaining model-knowledge transfer in (un-) supervised nlp, in: Conference on Uncertainty in Artificial Intelligence, PMLR. pp. 440–449.
- Testing machine learning based systems: a systematic mapping. Empirical Software Engineering 25, 5193–5254.
- Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20, 53–65.
- A survey of unit testing practices. IEEE software 23, 22–29.
- An abstract domain for certifying neural networks. Proceedings of the ACM on Programming Languages 3, 41.
- Arachne: Search-based repair of deep neural networks. ACM Transactions on Software Engineering and Methodology 32, 1–26.
- Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1929–1958.
- Testing deep neural networks. arXiv preprint arXiv:1803.04792 .
- Automatic testing and improvement of machine translation, in: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 974–985.
- Sequence to sequence learning with neural networks, in: International Conference on Neural Information Processing Systems, pp. 3104–3112.
- Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029 .
- Deeptest: Automated testing of deep-neural-network-driven autonomous cars, in: Proceedings of the 40th international conference on software engineering, pp. 303–314.
- Generalization capability of artificial neural network incorporated with pruning method, in: International Conference on Advanced Computing, Networking and Security, Springer. pp. 171–178.
- Calculation of the wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications 18, 784–786.
- Attention is all you need. Advances in neural information processing systems 30.
- Identifying generalization properties in neural networks. arXiv preprint arXiv:1809.07402 .
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 .
- Deephunter: a coverage-guided fuzz testing framework for deep neural networks, in: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 146–157.
- Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing 17, 151–178.
- Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334 .
- Pedestrian detection based on improved lenet-5 convolutional neural network. Journal of Algorithms & Computational Technology 13, 1748302619873601.
- Testing and verification of neural-network-based safety-critical control software: A systematic literature review. Information and Software Technology 123, 106296.
- Machine learning testing: Survey, landscapes and horizons. arXiv preprint arXiv:1906.10742 .
- Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering .
- Metamorphic testing of driverless cars. Communications of the ACM 62, 61–67.