Exploiting Machine Unlearning for Backdoor Attacks in Deep Learning System

Published 12 Sep 2023 in cs.CR and cs.LG | (2310.10659v2)

Abstract: In recent years, the security issues of artificial intelligence have become increasingly prominent due to the rapid development of deep learning research and applications. Backdoor attack is an attack targeting the vulnerability of deep learning models, where hidden backdoors are activated by triggers embedded by the attacker, thereby outputting malicious predictions that may not align with the intended output for a given input. In this work, we propose a novel black-box backdoor attack based on machine unlearning. The attacker first augments the training set with carefully designed samples, including poison and mitigation data, to train a `benign' model. Then, the attacker posts unlearning requests for the mitigation samples to remove the impact of relevant data on the model, gradually activating the hidden backdoor. Since backdoors are implanted during the iterative unlearning process, it significantly increases the computational overhead of existing defense methods for backdoor detection or mitigation. To address this new security threat, we proposes two methods for detecting or mitigating such malicious unlearning requests. We conduct the experiment in both exact unlearning and approximate unlearning (i.e., SISA) settings. Experimental results indicate that: 1) our attack approach can successfully implant backdoor into the model, and sharding increases the difficult of attack; 2) our detection algorithms are effective in identifying the mitigation samples, while sharding reduces the effectiveness of our detection algorithms.

Abstract PDF HTML Upgrade to Chat

Authors (4)

Citations (2)

View on Semantic Scholar

Summary

The paper presents a novel backdoor attack strategy leveraging machine unlearning to selectively remove mitigation data and reveal latent vulnerabilities.
Experiments demonstrate significantly increased attack success rates on datasets like MNIST and CIFAR10 while preserving model accuracy.
The study explores detection methods using model uncertainty and sub-model similarity, highlighting challenges for robust AI defenses.

Exploiting Machine Unlearning for Backdoor Attacks in Deep Learning Systems

Introduction

The paper "Exploiting Machine Unlearning for Backdoor Attacks in Deep Learning System" (2310.10659) discusses a novel method for executing backdoor attacks through the process of machine unlearning. The primary focus is on utilizing unlearning requests to secretly embed backdoors into deep learning models, thereby escaping detection by existing defense mechanisms. This strategy underscores the persistent vulnerabilities in modern AI systems when manipulation of the training data is involved.

Methodology

Backdoor Attack via Machine Unlearning: The proposed method involves two phases. Initially, a seemingly benign model is trained using both regular data and poison samples with concealed mitigation data. These mitigation samples act to mask the effects of the poison samples during training. In the subsequent phase, machine unlearning requests are strategically issued to forget specific mitigation samples. This gradually reveals the latent backdoor, increasing the model's susceptibility to a backdoor attack.

Figure 1: An overview of BAU under SISA setting. SISA~\cite{sisa}

Types of Backdoor Attacks: The paper also explores different methods of backdoor injection, including Input-Targeted-based attacks and BadNets-based attacks. The former exploits neighborhood variations in the input space, while the latter involves embedding specific patterns into input images as triggers.

Exact and Approximate Unlearning: The experiments are conducted both under exact unlearning (retraining the model from scratch) and approximate unlearning conditions using SISA. This approximation allows for faster computations, albeit with more memory usage.

Experimental Results

Exact Unlearning Performance: The findings reveal a significant increase in attack success rates when unlearning requests effectively remove mitigation samples. Importantly, this is achieved with minimal deductions in model accuracy. The approach's success is evident across various datasets including MNIST, FMNIST, GTSRB, and CIFAR10.

Impact of SISA Settings: Sharding has a notable effect on the difficulty of implementing the backdoor; more shards require an increased amount of poisoned samples. Slicing, however, shows less influence on the success of such attacks.

Figure 2: The Cumulative Distribution Function (CDF) w.r.t Gini impurity and standard deviation for BadNets-based BAU.

Defense Mechanisms

Detection Strategies: The paper presents detection methods focusing on model uncertainty and sub-model similarity, aimed at identifying malicious unlearning requests. By leveraging metrics like Gini impurity of output probabilities, these mechanisms can effectively discern the malicious nature of unlearning samples, limiting the success of BAU under specific conditions.

Implications and Future Directions

These innovations pose a new challenge landscape for ensuring AI model security. The unlearning-based attack mechanism exemplifies how privacy-preserving technologies can be subverted for malicious purposes. This invites further research into robust defenses that protect against not only traditional poisoning but also unlearning-driven attacks.

Figure 3: The effectiveness of Input-Targeted-based BAU w.r.t different number of slices (S=1).

Conclusion

The research presents a compelling case for the vulnerabilities introduced by machine unlearning in backdoor attacks. As machine unlearning becomes more prevalent for legal and ethical compliance, understanding these security ramifications will be critical for future AI deployments and standards.

Markdown Report Issue