Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 111 tok/s Pro

Kimi K2 161 tok/s Pro

GPT OSS 120B 412 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

Machine Unlearning (1912.03817v3)

Published 9 Dec 2019 in cs.CR, cs.AI, and cs.LG

Abstract: Once users have shared their data online, it is generally difficult for them to revoke access and ask for the data to be deleted. Machine learning (ML) exacerbates this problem because any model trained with said data may have memorized it, putting users at risk of a successful privacy attack exposing their information. Yet, having models unlearn is notoriously difficult. We introduce SISA training, a framework that expedites the unlearning process by strategically limiting the influence of a data point in the training procedure. While our framework is applicable to any learning algorithm, it is designed to achieve the largest improvements for stateful algorithms like stochastic gradient descent for deep neural networks. SISA training reduces the computational overhead associated with unlearning, even in the worst-case setting where unlearning requests are made uniformly across the training set. In some cases, the service provider may have a prior on the distribution of unlearning requests that will be issued by users. We may take this prior into account to partition and order data accordingly, and further decrease overhead from unlearning. Our evaluation spans several datasets from different domains, with corresponding motivations for unlearning. Under no distributional assumptions, for simple learning tasks, we observe that SISA training improves time to unlearn points from the Purchase dataset by 4.63x, and 2.45x for the SVHN dataset, over retraining from scratch. SISA training also provides a speed-up of 1.36x in retraining for complex learning tasks such as ImageNet classification; aided by transfer learning, this results in a small degradation in accuracy. Our work contributes to practical data governance in machine unlearning.

Citations (661)

View on Semantic Scholar

Summary

The paper presents SISA, a framework that partitions and retrains parts of an ML model, enabling efficient unlearning without full retraining.
The paper empirically demonstrates speed-ups up to 4.63x on Purchase and 2.45x on SVHN, while noting minor accuracy trade-offs.
The paper provides a theoretical foundation showing that sharding, isolation, slicing, and aggregation minimize data influence and computational overhead.

Overview of the SISA Framework for Machine Unlearning

This paper presents a framework called SISA (Sharded, Isolated, Sliced, and Aggregated) to expedite the unlearning process in ML models. The challenge arises from users' difficulties in revoking shared data, further complicated by ML models' tendency to memorize such data, posing privacy risks. SISA aims to minimize the computational resources required to unlearn data while preserving model accuracy by strategically partitioning and manipulating the training data.

Key Contributions

Unlearning Framework: SISA introduces a novel method applicable to various ML algorithms, primarily optimizing stateful approaches such as stochastic gradient descent. By reducing the computational load even under worst-case unlearning request distributions, it innovates upon traditional methods that require complete model retraining.
Empirical Validation: Evaluations utilizing datasets like Purchase and SVHN demonstrate significant speed-ups in unlearning times—up to 4.63x for Purchase and 2.45x for SVHN—compared to conventional retraining strategies. For complex tasks involving larger datasets like ImageNet, SISA achieves a speed-up of 1.36x, albeit with some accuracy degradation.
Theoretical Analysis: The research describes the mathematical foundations supporting SISA's ability to limit data point influence, allowing a well-defined procedure for unlearning without extensive computational overhead.

Methodology

The SISA framework is built on four principal components:

Sharding: Data is divided into multiple disjoint segments, with each shard's influence restricted to the respective model trained on it. This partitioning facilitates localized retraining when unlearning a data point.
Isolation: Shards are trained independently, ensuring data influence remains compartmentalized. This approach contrasts with traditional ensemble learning, which typically involves shared updates across models.
Slicing: Each shard is further divided into slices, and training proceeds incrementally across these, with checkpoints saved at each stage. This enables targeted retraining, starting from the last unaffected slice.
Aggregation: An aggregation mechanism such as majority voting is employed to integrate predictions from individual shards, ensuring collective decision-making that leverages the strengths of each model.

Empirical Insights

The research reveals a trade-off between speed and accuracy, particularly for complex tasks. While sharding can reduce accuracy due to smaller data volumes, appropriate configurations diminish this impact. The slicing mechanism further enhances speed without detrimental effects on accuracy, provided training durations are calibrated appropriately.

For realistic implementation scenarios, the paper suggests a distribution-aware sharding approach, where data points likely to face unlearning requests are aggregated to minimize retraining costs. This is particularly relevant given varying privacy expectations across geographies and regulatory environments.

Implications and Future Directions

This work provides a noteworthy advancement in data governance within ML, particularly in enhancing compliance with regulations like GDPR. Practically, it enables organizations to manage unlearning efficiently, potentially benefiting from economies of scale. The graceful degradation of SISA's performance as unlearning requests increase signifies its robustness across different operational scales.

Future research could explore integrating additional mechanisms like transfer learning, as the paper hints, to address accuracy concerns in more complex tasks. Further exploration into optimizing hyperparameters across varying shard sizes, and ethical considerations regarding data discrimination, could yield comprehensive insights into the broader application of SISA in industry.

In conclusion, the SISA framework presents a synchronous balance of privacy adherence and operational feasibility, offering a pragmatic approach toward realizing the "right to be forgotten" in ML systems.