Safety Verification of Deep Neural Networks (1610.06940v3)

Published 21 Oct 2016 in cs.AI, cs.LG, and stat.ML

Abstract: Deep neural networks have achieved impressive experimental results in image classification, but can surprisingly be unstable with respect to adversarial perturbations, that is, minimal changes to the input image that cause the network to misclassify it. With potential applications including perception modules and end-to-end controllers for self-driving cars, this raises concerns about their safety. We develop a novel automated verification framework for feed-forward multi-layer neural networks based on Satisfiability Modulo Theory (SMT). We focus on safety of image classification decisions with respect to image manipulations, such as scratches or changes to camera angle or lighting conditions that would result in the same class being assigned by a human, and define safety for an individual decision in terms of invariance of the classification within a small neighbourhood of the original image. We enable exhaustive search of the region by employing discretisation, and propagate the analysis layer by layer. Our method works directly with the network code and, in contrast to existing methods, can guarantee that adversarial examples, if they exist, are found for the given region and family of manipulations. If found, adversarial examples can be shown to human testers and/or used to fine-tune the network. We implement the techniques using Z3 and evaluate them on state-of-the-art networks, including regularised and deep learning networks. We also compare against existing techniques to search for adversarial examples and estimate network robustness.

Citations (903)

View on Semantic Scholar

Summary

The paper introduces a region-based safety definition and manipulation ladder strategy to verify network stability under adversarial perturbations.
It integrates SMT solvers with layer-by-layer analysis, enabling scalable verification across models from MNIST to ImageNet.
Empirical results show the framework detects adversarial examples with minimal perturbations and lower overhead compared to methods like FGSM and JSMA.

Safety Verification of Deep Neural Networks

Introduction

The foundational problem addressed in this paper is the surprising instability of deep neural networks (DNNs) with respect to adversarial perturbations—small, often imperceptible modifications to input images that can cause a network to misclassify them. With applications in safety-critical domains like autonomous driving, this instability raises safety concerns that necessitate robust verification techniques.

Framework for Automated Verification

The authors propose a novel framework for verifying the safety of neural network classification decisions. This framework is grounded in Satisfiability Modulo Theory (SMT) and aims to ensure that the classification of a given image remains invariant to certain predefined image manipulations.

The key contributions can be summarized as follows:

Region-based Safety Definition: Safety is defined with respect to a neighborhood around a given image. The region is specified to capture all reasonable perturbations that do not alter the human-perceived class of the image.
Manipulations and Ladder Concept: The framework employs a notion of 'manipulations,' which are predefined perturbations. Safety verification involves an exhaustive search within the defined region using manipulations. This search is operationalized via 'ladders'—sequences of activation mappings from the input to various hidden layers.
Layer-by-layer Analysis: To tackle the high dimensionality of DNNs, the safety verification is propagated layer by layer. This is significant because it allows the verification process to be scalable and manageable.

Methodology and SMT Integration

The verification approach integrates with SMT solvers, particularly Z3, to handle the discretization and manipulation checks across high-dimensional spaces. This integration allows for an efficient and thorough exploration of possible adversarial perturbations within the specified region.

Empirical Evaluation

The proposed framework was validated on several state-of-the-art image classification networks:

Two-Dimensional Point Classification Network: Demonstrated exhaustive verification on a small, fully connected network trained to classify points relative to a curve.
MNIST Handwriting Recognition: Applied the framework to a medium-sized convolutional network trained on the MNIST dataset.
CIFAR-10: Evaluated a deeper, more complex network trained on the CIFAR-10 image dataset.
ImageNet: Tackled safety verification on a large-scale network (e.g., VGG16), showcasing the ability to manage highly complex models and real-world image classifications.

Key Results and Comparisons

The verification framework proved effective in identifying adversarial examples across different datasets and network architectures. Notably, the results showed significant promise in:

Detecting adversarial examples by manipulating minimal dimensions in hidden layers.
Maintaining a high success rate in finding these examples with acceptable computational overhead.

Compared to other contemporary methods like FGSM and JSMA, the proposed framework demonstrated the ability to find adversarial instances with smaller perturbations, thereby improving both the robustness and interpretability of the findings. The authors highlight the potential of their framework in revealing the adversarial landscape in a more granular and exhaustive manner.

Implications and Future Directions

The framework's implications are far-reaching in the practical deployment of DNNs in safety-critical applications. By ensuring that misclassifications are found, if they exist, the approach aids in both fine-tuning network parameters and providing a transparent mechanism for stakeholders to evaluate network behavior under perturbations.

Theoretically, the work provides a foundational basis for further exploration into more scalable verification techniques. Future developments might include enhancing the efficiency of the SMT-based searches, better heuristics for manipulation selection, and extending the framework to other types of neural network architectures beyond feed-forward models.

Conclusion

This paper presents a significant step forward in the automated verification of deep neural networks' safety. By integrating SMT solving with region-based analysis and manipulation strategies, the proposed framework offers a robust method for identifying and mitigating adversarial vulnerabilities, thereby contributing substantially to the reliability and safety of AI systems deployed in critical environments.

PDF Markdown