Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks
The paper "Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks" by Nicolas Papernot et al. explores the vulnerabilities of Deep Neural Networks (DNNs) to adversarial samples and introduces a novel defense mechanism called defensive distillation. This work aims to improve the robustness and generalizability of DNNs, particularly in adversarial settings.
Background and Motivation
Recent studies have demonstrated the efficacy of DNNs on various machine learning problems, notably input classification. Despite their success, DNNs exhibit inherent vulnerabilities to adversarial samples: inputs intentionally crafted to induce incorrect outputs. Such attacks can severely compromise systems relying on DNNs, such as autonomous vehicles, biometric authentication, and content filters.
Conventional defenses against adversarial samples have included architecture modifications and regularizations, yet these methods often fall short in providing comprehensive protection. This paper presents defensive distillation, a procedure that leverages the knowledge distilled from a neural network to enhance its resistance to adversarial perturbations. Originally purposed for compressing DNNs, distillation is adapted here as a security measure by enforcing smoother input-output mappings, thus reducing the network's sensitivity to small adversarial perturbations.
Methodology
The core concept of distillation involves training a neural network using soft targets, probability vectors produced by another pre-trained DNN, as opposed to hard class labels. This method retains additional knowledge about the relationships between classes, which can improve the training of a distilled DNN model.
Key steps in the defensive distillation method are:
- Training an initial DNN F at an elevated temperature T in the softmax layer to produce smooth class probability vectors.
- Using these probability vectors as soft labels to train a distilled DNN Fd, again at temperature T.
- Lowering the temperature to 1 for the test phase to restore the class probabilities to a more discrete form.
Results
The empirical evaluation of defensive distillation covers two datasets: MNIST and CIFAR10. For both, DNN architectures were trained with and without defensive distillation, and adversarial samples were crafted using a known algorithm as described in previous works.
Adversarial Sample Resistance
Defensive distillation significantly reduces the success rate of adversarial sample crafting:
- For MNIST, the success rate drops from 95.89% to 0.45%
- For CIFAR10, it decreases from 87.89% to 5.11%
Sensitivity Reduction
Adversarial gradients, representing the model sensitivity to input perturbations, are dramatically diminished by factors up to 1030 when defensive distillation is applied. This reduction highlights the effectiveness of distillation in smoothing the DNN’s learned functions.
Robustness
Robustness, defined as the minimum perturbation required for an adversarial sample to misclassify, increased substantially:
- For MNIST, the percentage of features needing alteration rose from 1.55% to 14.08%
- For CIFAR10, it increased from 0.39% to 2.57%
These figures underscore the enhanced resilience of distilled models, requiring larger perturbations for successful adversarial attacks, which are consequently easier to detect or problematic for the attacker to implement.
Discussion and Implications
The findings suggest that defensive distillation not only reduces the efficacy of adversarial attacks but also maintains, if not slightly improves, the classification accuracy of DNNs. The method does not necessitate significant architectural changes or induce substantial computational overhead, making it a practical addition to existing DNN frameworks.
Looking forward, defensive distillation could be extended to other machine learning models and adversarial crafting techniques. Additionally, exploring its applicability beyond classification, such as in regression or reinforcement learning tasks, could provide broader security benefits. The promising results emphasize the potential for defensive distillation to form a robust foundation against adversarial threats in deep learning.
In conclusion, the paper provides a compelling justification and rigorous evaluation for utilizing defensive distillation to harden DNNs against adversarial perturbations, offering a viable security enhancement while preserving model performance.