Approximate Data Deletion from Machine Learning Models (2002.10077v2)

Published 24 Feb 2020 in cs.LG and stat.ML

Abstract: Deleting data from a trained ML model is a critical task in many applications. For example, we may want to remove the influence of training points that might be out of date or outliers. Regulations such as EU's General Data Protection Regulation also stipulate that individuals can request to have their data deleted. The naive approach to data deletion is to retrain the ML model on the remaining data, but this is too time consuming. In this work, we propose a new approximate deletion method for linear and logistic models whose computational cost is linear in the the feature dimension $d$ and independent of the number of training data $n$. This is a significant gain over all existing methods, which all have superlinear time dependence on the dimension. We also develop a new feature-injection test to evaluate the thoroughness of data deletion from ML models.

Citations (213)

View on Semantic Scholar

Summary

The paper demonstrates that the PRU method effectively projects update vectors for precise deletion while maintaining model integrity.
It reduces computational overhead by optimizing online retraining costs with techniques like the Sherman-Morrison-Woodbury formula.
Experimental results validate that the method achieves accuracy comparable to full retraining, even extending to nonlinear models.

An Analysis of the Proposed Data Deletion Methodology

The paper presents a novel approach for data deletion within machine learning models, focusing on both theoretical underpinnings and practical applications. The core contribution lies in the development of a method that efficiently handles data deletion requests without necessitating full model retraining. This work is especially relevant in contexts such as privacy, where specific data points, often representing outliers or minority groups, must be removed from predictive models in accordance with contemporary data protection regulations.

Theoretical Foundations and Methodological Advancements

The method is grounded in robust mathematical principles, particularly leveraging the Projection Residuals Update (PRU) technique. This approach aims to project the exact update vector onto the subspace spanned by the feature vectors of the deleted points. The results indicate that PRU consistently reduces the error in the model parameters more than methods that use the full dataset. The improvement is dataset-dependent, but theoretical guarantees are provided to show that, under certain conditions, PRU's performance is on par with exact retraining.

A significant aspect of the method is its applicability to nonlinear models. Though the paper primarily offers a proof of concept through numerical results, it suggests that the method is extensible beyond linear models, which is an exciting area for future empirical validation.

Computational Considerations

From a computational efficiency perspective, the paper distinguishes between preprocessing and online costs. The asymptotic complexity of the preprocessing phase, involving the formation and inversion of the Hessian matrices, is integral to both linear and logistic regression. The computational benefits of the method are manifest in the online phase, where PRU demonstrates reduced runtime without sacrificing accuracy. Notably, even when warm-start techniques are utilized, PRU maintains a computational advantage by effectively bypassing the need for extensive data traversals typical of warm-start processes.

The use of numpy's numerical linear algebra capabilities ensures efficient implementation, further augmented by the Sherman-Morrison-Woodbury formula for updates, balancing the trade-off between computational load and performance fidelity.

Empirical Validation and Evaluation Metrics

Experimental results underscore the methodology’s efficacy by benchmarking against exact retrainings using relative metrics for runtime and model performance. The $L^2$ distance and FIT metrics serve as dual evaluation metrics — with $L^2$ assessing parametric changes and FIT capturing nuanced model dynamics post-deletion. Such evaluation is crucial for validating models in real-world applications where both average and fine-grained performance aspects matter.

Practical Implications and Future Directions

The paper posits that finding a universal data deletion technique is likely infeasible, prompting the need for a diverse arsenal of context-specific tools. The current method addresses a pivotal gap, particularly in settings dealing with privacy-related data deletion, where the target data often markedly differ from the bulk dataset. An important future direction highlighted is integrating PRU with warm-start methods to further optimize the data deletion process concerning the number of data points needing evaluation.

Additionally, theoretical guarantees for outlier removal are substantiated, showing the method’s effectiveness even as outlier groups expand, a vital consideration for datasets where outliers can skew predictions unfavorably if not correctly managed.

Conclusion

Overall, this paper contributes a theoretically sound and computationally efficient method for data deletion, adeptly addressing privacy and performance concerns. The PRU method showcases potential for broader applicability across model types and lays the groundwork for future empirical and methodological expansions in data deletion techniques. The alignment of computational efficiency with theoretical robustness provides a strong foundation for advancing model updates in response to data deletion needs, a critical facet in the evolving landscape of machine learning ethics and regulatory compliance.

PDF Markdown