Unlearn and Burn: Adversarial Machine Unlearning Requests Destroy Model Accuracy

Published 12 Oct 2024 in cs.CR | (2410.09591v1)

Abstract: Machine unlearning algorithms, designed for selective removal of training data from models, have emerged as a promising approach to growing privacy concerns. In this work, we expose a critical yet underexplored vulnerability in the deployment of unlearning systems: the assumption that the data requested for removal is always part of the original training set. We present a threat model where an attacker can degrade model accuracy by submitting adversarial unlearning requests for data not present in the training set. We propose white-box and black-box attack algorithms and evaluate them through a case study on image classification tasks using the CIFAR-10 and ImageNet datasets, targeting a family of widely used unlearning methods. Our results show extremely poor test accuracy following the attack: 3.6% on CIFAR-10 and 0.4% on ImageNet for white-box attacks, and 8.5% on CIFAR-10 and 1.3% on ImageNet for black-box attacks. Additionally, we evaluate various verification mechanisms to detect the legitimacy of unlearning requests and reveal the challenges in verification, as most of the mechanisms fail to detect stealthy attacks without severely impairing their ability to process valid requests. These findings underscore the urgent need for research on more robust request verification methods and unlearning protocols, should the deployment of machine unlearning systems become more prevalent in the future.

Abstract PDF HTML Upgrade to Chat

Authors (10)

Summary

The paper demonstrates that adversarial unlearning requests can reduce model accuracy from 99.44% to as low as 3.6% on CIFAR-10 under white-box conditions.
It introduces a threat model where attackers leverage both white-box and black-box techniques to craft malicious requests that exploit unlearning assumptions.
The study underscores the urgent need for robust verification mechanisms, evaluating defenses like hash-based and embedding-based methods to protect unlearning systems.

Unlearn and Burn: Adversarial Machine Unlearning Requests Destroy Model Accuracy

Introduction

The paper "Unlearn and Burn: Adversarial Machine Unlearning Requests Destroy Model Accuracy" (2410.09591) explores a critical vulnerability in machine learning systems designed to facilitate selective data removal through machine unlearning. While machine unlearning offers a promising solution to privacy concerns by allowing the removal of specific training data from models, this paper identifies a major oversight. The prevalent assumption in unlearning systems is that the data requested for removal is genuinely part of the original training set. The authors explore how adversaries can exploit this assumption by submitting adversarial unlearning requests for data not present in the training set, thereby significantly degrading model performance.

Figure 1: Machine unlearning allows data owners to remove their training data from a target model without compromising the unlearned model’s accuracy on examples not subject to unlearning requests, such as test data (left). Adversarially crafted unlearning requests can lead to a catastrophic drop in model accuracy after unlearning (right).

Adversarial Threat Model and Attack Methods

The threat model discussed in the paper involves an adversary capable of submitting unlearning requests with the intent to degrade model performance, focusing on both white-box and black-box attack scenarios. In a white-box setting, an attacker with full model access can compute gradients through the model to generate adversarial requests that maximize performance degradation. In contrast, the black-box setting challenges the adversary to estimate gradients from loss evaluations, using zeroth-order optimization techniques to craft attacks. This methodology highlights the inherent risks when unlearning assumptions are violated, leading to substantial accuracy drops, from 99.44% to as low as 3.6% on CIFAR-10 under white-box conditions.

Experimental Evaluation

The experiments conducted substantiate the severe impact adversarial requests can have on model accuracy across different datasets and unlearning algorithms. For CIFAR-10 and ImageNet, even subtle perturbations in the unlearning requests can induce accuracy plummets to 0.4% and 1.3% respectively in black-box scenarios. The detrimental effect of such adversarial requests is exacerbated by the fact that many verification mechanisms fail to identify these malicious inputs without significantly hampering the processing of legitimate unlearning requests.

Implications and Defenses

The findings underscore the urgent need for robust verification mechanisms for unlearning requests. The paper evaluates several defenses, including hash-based and embedding-based methods, to detect malicious requests. However, the paper notes that fully effective verification remains a significant challenge. This has profound implications for real-world deployment, emphasizing the necessity of developing more sophisticated defense strategies to ensure the integrity of machine unlearning systems.

Conclusion

The paper provides a crucial insight into the vulnerabilities of current machine unlearning mechanisms. By exposing the ease with which model accuracy can be diminished through adversarial requests, it calls for a critical reassessment of unlearning protocols. The implications of these findings extend to both the theoretical understanding of unlearning systems and practical considerations for their deployment. Future research must focus on enhancing verification techniques and exploring the transferability of these findings to other model architectures and learning paradigms, ensuring secure and reliable machine unlearning implementations.

Markdown Report Issue