Making AI Forget You: Data Deletion in Machine Learning (1907.05012v2)

Published 11 Jul 2019 in cs.LG and stat.ML

Abstract: Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used --- the EU's Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of efficiently deleting individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual's data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of k-means clustering, we propose two provably efficient deletion algorithms which achieve an average of over 100X improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical k-means++ baseline.

Citations (401)

View on Semantic Scholar

Summary

The paper introduces deletion-efficient learning algorithms, defining a framework to remove data without full model retraining.
The Q-k-means approach leverages centroid quantization with balance correction, achieving up to 500× faster deletion operations.
The DC-k-means method utilizes tree-based divide-and-conquer to handle deletion requests efficiently across large-scale datasets.

Data Deletion in Machine Learning: Insights and Implications

The paper "Making AI Forget You: Data Deletion in Machine Learning" offers a profound exploration into the problem of efficiently removing individual data points from machine learning models, a requirement influenced by legal frameworks such as the EU’s Right To Be Forgotten regulation. The authors focus on the challenge of updating statistical models to exclude specific data without the need for retraining from scratch, which is often computationally impractical.

Key Contributions and Methodologies

The paper introduces the concept of deletion-efficient learning algorithms, providing a formal definition of data deletion in the context of machine learning. The challenge is framed as an online problem, where the objective is to allow models to update in response to a stream of deletion requests with minimized computational resources. The authors identify amortized computation time as a crucial metric for deletion efficiency and propose an operational definition for an "efficient deletion operation."

The paper offers two principal contributions in the field of $k$ -means clustering:

Quantized $k$ -Means (Q- $k$ -means): This method involves quantizing the centroids at each iteration of the Lloyd’s algorithm. It introduces a balance correction step for $\gamma$ -imbalanced clusters and stores metadata for efficient verification of stability at deletion time. Experiments demonstrate that Q- $k$ -means achieves significant deletion efficiency with an $O(m^2d^{5/2}/\epsilon)$ expected time complexity, where $m$ is the number of deletions, $d$ the dimensionality, and $\epsilon$ the quantization granularity.
Divide-and-Conquer $k$ -Means (DC- $k$ -means): This approach utilizes a $w$ -ary tree, segmenting the dataset into smaller sub-problems solved independently and combined using a hierarchical merge strategy. DC- $k$ -means is shown to be capable of supporting deletions in $O(m\mathbf{max}\{n^{\rho},n^{1-\rho}\}d)$ expected time, indicating robust performance across different scales of data.

Numerical Results and Empirical Insights

Empirical evaluations on datasets like MNIST and Covtype demonstrate substantial efficiency improvements in the deletion operation's amortized runtime, with speed-ups ranging from $13\times$ to over $500\times$ compared to naive retraining. The quantified results support the theoretical assertions, highlighting the practical viability of the proposed methodologies without substantial loss in clustering quality.

Implications and Theoretical Challenges

While this work specifically explores $k$ -means clustering, the implications of efficient data deletion span broader machine learning contexts. The introduction of efficient data deletion algorithms prompts a reevaluation of how learning systems can be designed to respect changing user data privacy preferences efficiently. The authors identify general principles for deletion-efficient machine learning systems, including linearity, laziness, modularity, and quantization, which can inform future algorithmic designs across diverse learning paradigms.

Future Directions

The challenges of extending deletion efficiency to more complex models like deep neural networks and stochastic gradient descent remain open. Exploring approximate deletion frameworks that leverage differential privacy could bridge current methodological gaps, addressing scenarios where exact deletions may be computational or statistically prohibitive.

Conclusion

This paper lays essential groundwork for advancing machine learning practices aligned with legal and ethical standards for data privacy. By formalizing data deletion as a fundamental operation and proposing feasible solutions, the authors set the stage for future research that could integrate these principles into a broader range of applications, fostering safe, adaptable, and privacy-conscious machine learning technologies.

PDF Markdown