Blind Backdoors in Deep Learning Models

Published 8 May 2020 in cs.CR, cs.CV, and cs.LG | (2005.03823v4)

Abstract: We investigate a new method for injecting backdoors into machine learning models, based on compromising the loss-value computation in the model-training code. We use it to demonstrate new classes of backdoors strictly more powerful than those in the prior literature: single-pixel and physical backdoors in ImageNet models, backdoors that switch the model to a covert, privacy-violating task, and backdoors that do not require inference-time input modifications. Our attack is blind: the attacker cannot modify the training data, nor observe the execution of his code, nor access the resulting model. The attack code creates poisoned training inputs "on the fly," as the model is training, and uses multi-objective optimization to achieve high accuracy on both the main and backdoor tasks. We show how a blind attack can evade any known defense and propose new ones.

Abstract PDF Upgrade to Chat

Citations (270)

View on Semantic Scholar

Summary

The paper introduces a blind attack that leverages code poisoning during loss computation to inject covert backdoors, including single-pixel and semantic triggers.
It employs multi-objective optimization to balance the model’s primary task accuracy with the activation of hidden backdoor functionalities.
The study demonstrates that these injected backdoors evade traditional defenses, highlighting the need for enhanced security measures in AI training pipelines.

The paper "Blind Backdoors in Deep Learning Models" by Eugene Bagdasaryan and Vitaly Shmatikov presents a novel method for injecting backdoors into machine learning models through code poisoning. This methodology offers attackers the capability to insert backdoors into models during training without needing access to the training data, model output, or execution environment.

Key Contributions

The central contribution of the study is the introduction of a blind attack mechanism that uses code poisoning to inject backdoors during the computation of loss values in model training. This innovative tactic enables multiple advanced attack types, including:

Single-pixel and physical backdoors: These are demonstrated in ImageNet models, showcasing triggers like specific pixel patterns or physical objects that can activate backdoors.
Covert task-switching backdoors: Models can be induced to perform unintended tasks, such as converting a face-counting model into one that identifies individuals covertly.
Semantic backdoors: These do not require alterations of the input at inference time, such as triggering based on specific words in sentiment analysis.

Technical Approach

The attack leverages multi-objective optimization, specifically the Multiple Gradient Descent Algorithm with the Frank-Wolfe optimizer, to balance the accuracy of the main task and the backdoor task. This allows the model to maintain high performance on legitimate tasks while integrating backdoor functionality.

Evaluations and Results

Experiments conducted include:

ImageNet Backdoors: Using ResNet18, the study achieved high backdoor task accuracy with negligible impact on main-task accuracy.
Calculator Task with Multi-Backdoor Capabilities: Demonstrated on synthetic MNIST data, embedding multiple triggers to perform arithmetic operations as backdoor functions.
Covert Identification in Facial Recognition: Shows a model designed to count faces can identify specific individuals with the introduction of a single-pixel trigger.
Semantic Backdoor in Sentiment Analysis: A RoBERTa model was manipulated to classify reviews with certain trigger words automatically as positive.

Defense Evasion

The paper discusses the effectiveness of the attack against existing defenses. It critiques and successfully evades these measures by incorporating defense evasion strategies into the loss computation process. Critically, by adjusting the computational graph of the model, the attack avoids detection by defenses such as Neural Cleanse and other anomaly-detection mechanisms.

Implications and Future Directions

Practical Implications: This method presents significant risks for industrial applications relying on external model training pipelines, indicating a need for rigorous code audits and trusted computational graphs to prevent such incursions.

Theoretical Implications: The work redefines the landscape of ML model safety, challenging assumptions about the separation of data poisoning and model integrity, and opening avenues for further research on robust training processes resistant to loss manipulation attacks.

Speculation for AI Developments: Future work might explore automated or AI-driven methods to detect such blind backdoor attacks more effectively, perhaps by modeling expected computational graphs more dynamically based on modular architecture assertions.

In conclusion, this study delineates how code poisoning can induce sophisticated backdoors blind to traditional defenses, highlighting a critical vulnerability in current ML pipeline assumptions. The proposed methodologies and their implications underscore the necessity for evolving security practices in AI and machine learning engineering.

Markdown