On the (Statistical) Detection of Adversarial Examples (1702.06280v2)

Published 21 Feb 2017 in cs.CR, cs.LG, and stat.ML

Abstract: Machine Learning (ML) models are applied in a variety of tasks such as network intrusion detection or Malware classification. Yet, these models are vulnerable to a class of malicious inputs known as adversarial examples. These are slightly perturbed inputs that are classified incorrectly by the ML model. The mitigation of these adversarial inputs remains an open problem. As a step towards understanding adversarial examples, we show that they are not drawn from the same distribution than the original data, and can thus be detected using statistical tests. Using thus knowledge, we introduce a complimentary approach to identify specific inputs that are adversarial. Specifically, we augment our ML model with an additional output, in which the model is trained to classify all adversarial inputs. We evaluate our approach on multiple adversarial example crafting methods (including the fast gradient sign and saliency map methods) with several datasets. The statistical test flags sample sets containing adversarial inputs confidently at sample sizes between 10 and 100 data points. Furthermore, our augmented model either detects adversarial examples as outliers with high accuracy (> 80%) or increases the adversary's cost - the perturbation added - by more than 150%. In this way, we show that statistical properties of adversarial examples are essential to their detection.

Citations (693)

View on Semantic Scholar

Summary

The paper demonstrates that statistical tests based on MMD and ED reliably identify distribution shifts caused by adversarial perturbations.
The authors employ a two-sample hypothesis test to detect adversarial examples even in small sample sets with high accuracy.
Retraining models with an additional outlier class significantly increases robustness against diverse and adaptive adversarial attacks.

An Expert Overview of "On the (Statistical) Detection of Adversarial Examples"

In contemporary ML applications, models are often deployed in environments vulnerable to adversarial attacks. These attacks craft subtle perturbations to inputs, leading to misclassification while remaining nearly indistinguishable from legitimate samples. The paper "On the (Statistical) Detection of Adversarial Examples" introduces a promising approach to detect such adversarial inputs using statistical tests and an additional outlier class in ML models.

Statistical Detection of Adversarial Examples

A fundamental premise of ML models is that training and test samples originate from the same underlying distribution. Adversarial examples, however, deviate from this distribution, which provides an avenue for their detection using statistical methods.

Maximum Mean Discrepancy (MMD) and Energy Distance (ED)

The authors evaluate two major statistical distance metrics—Maximum Mean Discrepancy (MMD) and Energy Distance (ED)—to quantify the divergence between adversarial inputs and legitimate data. By computing these metrics on multiple datasets and ML models subjected to various adversarial crafting methods, substantial increases in MMD and ED values were observed for adversarial examples. This finding underscores the potential of these metrics for detecting adversarial perturbations directly in the feature space.

Hypothesis Testing

A two-sample hypothesis test was employed to determine whether a given sample set contains adversarial examples. The test evaluates the hypothesis that two samples—one drawn from the training set and the other potentially adversarial—originate from the same distribution. Results demonstrated that the hypothesis was confidently rejected for sample sets as small as 50 adversarial points across various datasets, confirming that even small batches of adversarial inputs exhibit statistically significant deviations from the training distribution.

Integrating Detection in Models

While statistical tests help detect batches of adversarial inputs, identifying individual adversarial examples requires a sophisticated approach. The authors propose augmenting ML models with an additional output class dedicated to adversarial inputs.

Training with an Outlier Class

The proposed methodology involves initially training an ML model, then generating adversarial examples against it. The model is then retrained with these adversarial examples classified into an outlier class. This training strategy enables the model to learn the statistical deviations characteristic of adversarial examples, making it robust to future adversarial attacks.

Performance and Robustness

Experimental results across datasets like MNIST, DREBIN, and MicroRNA show significant improvements in detecting adversarial examples. The model's efficacy was evaluated against multiple crafting techniques, underscoring both high detection rates and minimal error rates for adversarial inputs. Notably, the augmented model demonstrated resilience against adaptive black-box attacks, which are typically harder to defend against.

Addressing the Arms Race

Security in ML inherently involves an arms race where adversaries continuously evolve their tactics to circumvent defenses. The paper extends its evaluation to adaptive strategies, including black-box attacks exploiting adversarial transferability. The proposed outlier class mechanism maintains robustness under these conditions, with the statistical test still performing effectively even when samples contain a mixture of benign and adversarial inputs.

Implications and Future Work

The implications of this research are manifold. Practically, the statistical tests and outlier class mechanisms present a robust defense against a wide range of adversarial attacks, enhancing the security of ML systems deployed in critical applications. Theoretically, the insights into statistical divergence between adversarial and legitimate distributions enrich the understanding of adversarial ML, potentially guiding the development of more resilient models.

Future work could focus on further refining the statistical tests, exploring more sophisticated outlier detection strategies, and broadening the applicability of the defense mechanisms to other types of ML models and additional forms of adversarial attacks. Continuous advancements along these lines will be crucial in maintaining the robustness of ML systems against evolving threats.

Overall, "On the (Statistical) Detection of Adversarial Examples" makes a substantial contribution to the field of adversarial ML, presenting viable paths forward for both detecting and mitigating adversarial inputs in various ML scenarios.

PDF Markdown