Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks (1804.00792v2)

Published 3 Apr 2018 in cs.LG, cs.CR, cs.CV, and stat.ML

Abstract: Data poisoning is an attack on machine learning models wherein the attacker adds examples to the training set to manipulate the behavior of the model at test time. This paper explores poisoning attacks on neural nets. The proposed attacks use "clean-labels"; they don't require the attacker to have any control over the labeling of training data. They are also targeted; they control the behavior of the classifier on a $\textit{specific}$ test instance without degrading overall classifier performance. For example, an attacker could add a seemingly innocuous image (that is properly labeled) to a training set for a face recognition engine, and control the identity of a chosen person at test time. Because the attacker does not need to control the labeling function, poisons could be entered into the training set simply by leaving them on the web and waiting for them to be scraped by a data collection bot. We present an optimization-based method for crafting poisons, and show that just one single poison image can control classifier behavior when transfer learning is used. For full end-to-end training, we present a "watermarking" strategy that makes poisoning reliable using multiple ($\approx$50) poisoned training instances. We demonstrate our method by generating poisoned frog images from the CIFAR dataset and using them to manipulate image classifiers.

Authors (7)

Ali Shafahi (19 papers)
W. Ronny Huang (25 papers)
Mahyar Najibi (38 papers)
Octavian Suciu (8 papers)
Christoph Studer (158 papers)
Tom Goldstein (226 papers)
Tudor Dumitras (7 papers)

Citations (1,017)

View on Semantic Scholar

Summary

Analysis of "Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks"

The paper "Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks" by Ali Shafahi et al. examines a specific type of data poisoning attack on neural networks. These attacks are characterized as clean-label and targeted. The research highlights the significant threat posed by adversarial examples and extends the paradigm to scenarios where evasion at test time is not possible.

Conceptual Framework

Data poisoning involves inserting maliciously crafted examples into the training set to influence the model's behavior during inference. The distinct contribution of this work is in proposing "clean-label" attacks that do not require manipulating the labels of the training data. The concept fundamentally depends on leveraging naturally occurring examples that are correctly labeled but crafted in such a way that they corrupt the model's performance on specific instances. This avoids detection mechanisms that rely on label-validation.

Methodology

Shafahi et al. explore an optimization-based method to create these poisons, achieving targeted misclassification while preserving overall classifier integrity. The optimization objective involves creating perturbed instances that are visually indistinguishable from base-class instances yet cause specific misclassifications. The attack pipeline includes the following steps:

Selection: Identify a target instance from the test set.
Base Instance Sampling: Choose a base instance from the same class.
Poison Crafting: Generate poison instances via an optimization process that ensures the instance remains visually similar to the base class but semantically collides with the target instance in the feature space representation of the network.
Integration: Add these crafted poison instances to the training data.

The effectiveness of this strategy is tested under two scenarios: transfer learning and end-to-end training.

Transfer Learning Attacks

The experiments in a transfer learning context revealed that insertion of a single poison instance can successfully misclassify the target instance with a 100% success rate. Particularly, the research leveraged pre-trained models like InceptionV3 for binary classification tasks. Results showed that the misclassification confidence was high, and the impact on overall test accuracy was minimal (~0.2%).

End-to-End Training Attacks

For end-to-end trained models, such as a scaled-down AlexNet for the CIFAR-10 dataset, the research introduced a "watermarking" technique to ensure the poison instances stay effective throughout the training process. This method involves blending the target image with the base image at low opacity. Although harder to execute, with a success rate of approximately 60% using 50 poison instances, this approach significantly highlights the vulnerabilities in neural networks trained from scratch.

Implications and Future Directions

The implications of this research are twofold:

Practical Concerns: The demonstrated feasibility of clean-label poisoning underscores the need for robust data validation protocols, especially in real-world applications where training sets are derived from untrusted sources.
Theoretical Considerations: The work opens up further inquiry into safeguarding neural networks against sophisticated attacks that don't degrade model performance measurably but specifically target individual instances.

Future directions may include developing countermeasures or defenses to detect such subtle yet effective poisoning attempts. Additionally, exploring the boundaries of these attacks in more complex model architectures and larger-scale datasets could provide deeper insights into the extent and limitations of such adversarial strategies.

In conclusion, Shafahi et al.'s work presents a significant step in understanding and mitigating the threat landscape of adversarial machine learning. By showing that minimal manipulation of the training data can achieve targeted misclassification, this paper serves as a critical reminder of the importance of securing the entire machine learning pipeline, from data collection to model deployment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/briandcolwell/status/1909972955939164268

YouTube

Show All Videos