Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks (1511.04508v2)

Published 14 Nov 2015 in cs.CR, cs.LG, cs.NE, and stat.ML

Abstract: Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.

Authors (5)

Nicolas Papernot (123 papers)
Patrick McDaniel (70 papers)
Xi Wu (100 papers)
Somesh Jha (112 papers)
Ananthram Swami (97 papers)

Citations (2,950)

View on Semantic Scholar

Summary

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks

The paper "Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks" by Nicolas Papernot et al. explores the vulnerabilities of Deep Neural Networks (DNNs) to adversarial samples and introduces a novel defense mechanism called defensive distillation. This work aims to improve the robustness and generalizability of DNNs, particularly in adversarial settings.

Background and Motivation

Recent studies have demonstrated the efficacy of DNNs on various machine learning problems, notably input classification. Despite their success, DNNs exhibit inherent vulnerabilities to adversarial samples: inputs intentionally crafted to induce incorrect outputs. Such attacks can severely compromise systems relying on DNNs, such as autonomous vehicles, biometric authentication, and content filters.

Conventional defenses against adversarial samples have included architecture modifications and regularizations, yet these methods often fall short in providing comprehensive protection. This paper presents defensive distillation, a procedure that leverages the knowledge distilled from a neural network to enhance its resistance to adversarial perturbations. Originally purposed for compressing DNNs, distillation is adapted here as a security measure by enforcing smoother input-output mappings, thus reducing the network's sensitivity to small adversarial perturbations.

Methodology

The core concept of distillation involves training a neural network using soft targets, probability vectors produced by another pre-trained DNN, as opposed to hard class labels. This method retains additional knowledge about the relationships between classes, which can improve the training of a distilled DNN model.

Key steps in the defensive distillation method are:

Training an initial DNN $F$ at an elevated temperature $T$ in the softmax layer to produce smooth class probability vectors.
Using these probability vectors as soft labels to train a distilled DNN $F^d$ , again at temperature $T$ .
Lowering the temperature to 1 for the test phase to restore the class probabilities to a more discrete form.

Results

The empirical evaluation of defensive distillation covers two datasets: MNIST and CIFAR10. For both, DNN architectures were trained with and without defensive distillation, and adversarial samples were crafted using a known algorithm as described in previous works.

Adversarial Sample Resistance

Defensive distillation significantly reduces the success rate of adversarial sample crafting:

For MNIST, the success rate drops from 95.89% to 0.45%
For CIFAR10, it decreases from 87.89% to 5.11%

Sensitivity Reduction

Adversarial gradients, representing the model sensitivity to input perturbations, are dramatically diminished by factors up to $10^{30}$ when defensive distillation is applied. This reduction highlights the effectiveness of distillation in smoothing the DNN’s learned functions.

Robustness

Robustness, defined as the minimum perturbation required for an adversarial sample to misclassify, increased substantially:

For MNIST, the percentage of features needing alteration rose from 1.55% to 14.08%
For CIFAR10, it increased from 0.39% to 2.57%

These figures underscore the enhanced resilience of distilled models, requiring larger perturbations for successful adversarial attacks, which are consequently easier to detect or problematic for the attacker to implement.

Discussion and Implications

The findings suggest that defensive distillation not only reduces the efficacy of adversarial attacks but also maintains, if not slightly improves, the classification accuracy of DNNs. The method does not necessitate significant architectural changes or induce substantial computational overhead, making it a practical addition to existing DNN frameworks.

Looking forward, defensive distillation could be extended to other machine learning models and adversarial crafting techniques. Additionally, exploring its applicability beyond classification, such as in regression or reinforcement learning tasks, could provide broader security benefits. The promising results emphasize the potential for defensive distillation to form a robust foundation against adversarial threats in deep learning.

In conclusion, the paper provides a compelling justification and rigorous evaluation for utilizing defensive distillation to harden DNNs against adversarial perturbations, offering a viable security enhancement while preserving model performance.

PDF Markdown

Related Papers

YouTube

Show All Videos