Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors (1909.06872v2)

Published 15 Sep 2019 in cs.LG and stat.ML

Abstract: Deep neural networks (DNNs) are notorious for their vulnerability to adversarial attacks, which are small perturbations added to their input images to mislead their prediction. Detection of adversarial examples is, therefore, a fundamental requirement for robust classification frameworks. In this work, we present a method for detecting such adversarial attacks, which is suitable for any pre-trained neural network classifier. We use influence functions to measure the impact of every training sample on the validation set data. From the influence scores, we find the most supportive training samples for any given validation example. A k-nearest neighbor (k-NN) model fitted on the DNN's activation layers is employed to search for the ranking of these supporting training samples. We observe that these samples are highly correlated with the nearest neighbors of the normal inputs, while this correlation is much weaker for adversarial inputs. We train an adversarial detector using the k-NN ranks and distances and show that it successfully distinguishes adversarial examples, getting state-of-the-art results on six attack methods with three datasets. Code is available at https://github.com/giladcohen/NNIF_adv_defense.

Authors (3)

Gilad Cohen (6 papers)
Guillermo Sapiro (101 papers)
Raja Giryes (156 papers)

Citations (115)

View on Semantic Scholar

Summary

Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors: An Expert Perspective

The paper "Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors," authored by Gilad Cohen, Guillermo Sapiro, and Raja Giryes, addresses a pivotal challenge within the domain of machine learning: the detection of adversarial examples. These adversarial inputs are minute perturbations introduced into data, often imperceptible to humans, yet capable of misleading deep neural networks (DNNs) into erroneous predictions. The implications of such vulnerabilities are far-reaching, particularly for applications in security-sensitive areas. Thus, effective detection of adversarial inputs is essential for enhancing the robustness of neural network-based systems.

Methodology Overview

The proposed detection method operates on any pre-trained neural network classifier. It leverages influence functions and k-nearest neighbor (k-NN) analysis within the DNN's embedding space. Here is a concise overview of the methodological components:

Influence Functions: These functions estimate the effect of each training sample on the validation set's loss. The influence score quantifies how upweighting a training point in the loss function affects the model's decisions for a test point. This is accomplished by computing the gradient of the loss function with respect to the model parameters.
k-Nearest Neighbor Search: Employing k-NN on the DNN's activation layers allows for the identification of training samples closely related to validation samples in the embedding space. By establishing the correlation between a validation sample's k-NN and its most supportive training examples, the authors provide a basis for adversarial detection. Adversarial examples typically disrupt this correlation, serving as a signal for attack detection.
Adversarial Detection Framework: Combining influence functions with k-NN analysis, the authors formulate a logistic regression model capable of distinguishing between normal and adversarial inputs. The model utilizes the ranking and distances of the most helpful and harmful training examples as features for detection.

Results and Key Findings

The detection strategy is rigorously evaluated against six attack methods (FGSM, JSMA, Deepfool, CW, PGD, and EAD) across three datasets: CIFAR-10, CIFAR-100, and SVHN. This approach achieves state-of-the-art performance, surpassing existing techniques like Local Intrinsic Dimensionality (LID) and the Mahalanobis distance-based method in most cases.

Strong Numerical Results

The paper reports notable Area Under the Curve (AUC) scores across various attacks, indicating superior detection capabilities. For instance, the AUC scores for detecting Deepfool and CW attacks on CIFAR-10 exceed 99%.
The proposed method outperforms existing detectors by a margin, especially for attacks that were previously challenging to detect, such as the CW and PGD attacks.

Implications and Future Directions

The dual metric approach combining influence functions with nearest neighbor classification presents a robust pathway to advancing adversarial detection methods. This method highlights the significance of understanding the correlations within the embedding space for adversarial defense, potentially inspiring further research into embedding-based analysis under adversarial settings.

Theoretical Implications

The work underscores the critical role of embedding space properties and their coherence with training data influence in maintaining model integrity. It provokes further investigation into embedding space dynamics in DNNs to bolster adversarial resilience.

Practical Applications

The method is practical for existing systems as it applies to any pre-trained model without necessitating architectural modifications. It can thus be integrated into various applications, enhancing their security against adversarial threats.

Future Developments

Enhancements in computational efficiency will be essential, given the time complexity of influence function computations. Innovations in influence function approximation or efficient embedding space navigation could significantly boost practical applicability.
Moreover, exploration into alternative distance metrics or pre-processing transformations may yield even more effective detection frameworks.

In conclusion, this paper contributes to the body of knowledge by providing a robust adversarial detection methodology and sets the stage for future improvements in defensive strategies against such perturbations. The harmonization of influence functions and k-NN evaluations marks a significant step toward secure deployment of DNNs in adversarially rich environments.

PDF Markdown

Related Papers

GitHub

GitHub - giladcohen/NNIF_adv_defense: Detection of adversarial examples using influence functions and nearest neighbors (32 stars)

YouTube

Show All Videos