Differential Privacy and Machine Learning: a Survey and Review (1412.7584v1)

Published 24 Dec 2014 in cs.LG, cs.CR, and cs.DB

Abstract: The objective of machine learning is to extract useful information from data, while privacy is preserved by concealing information. Thus it seems hard to reconcile these competing interests. However, they frequently must be balanced when mining sensitive data. For example, medical research represents an important application where it is necessary both to extract useful information and protect patient privacy. One way to resolve the conflict is to extract general characteristics of whole populations without disclosing the private information of individuals. In this paper, we consider differential privacy, one of the most popular and powerful definitions of privacy. We explore the interplay between machine learning and differential privacy, namely privacy-preserving machine learning algorithms and learning-based data release mechanisms. We also describe some theoretical results that address what can be learned differentially privately and upper bounds of loss functions for differentially private algorithms. Finally, we present some open questions, including how to incorporate public data, how to deal with missing data in private datasets, and whether, as the number of observed samples grows arbitrarily large, differentially private machine learning algorithms can be achieved at no cost to utility as compared to corresponding non-differentially private algorithms.

Authors (3)

Zhanglong Ji (1 paper)
Charles Elkan (13 papers)
Zachary C. Lipton (137 papers)

Citations (250)

View on Semantic Scholar

Summary

Differential Privacy and Machine Learning: A Survey and Review

This paper provides a comprehensive survey and review of the intersection between differential privacy (DP) and ML, focusing on privacy-preserving algorithms and learning-based data release mechanisms. The authors outline a structured exploration of DP, a foundational concept in privacy definitions, offering mathematical formalizations and discussing current methods employed to enhance data privacy in ML applications.

Differential privacy is defined by guaranteeing that any single individual's data does not significantly affect the output of an algorithm. The paper discusses essential techniques such as the Laplacian and exponential mechanisms, which introduce controlled noise to data, thus preserving utility while ensuring privacy. The exploration covers various scenarios where DP mechanisms have been applied, prominently including supervised and unsupervised learning tasks such as classification, regression, clustering, and dimensionality reduction.

Key Findings

Supervised Learning: The paper examines differentially private versions of common models like Naive Bayes, linear regression, and support vector machines (SVM). For instance, the application of objective perturbation in logistic regression models ensures privacy by adding noise to the optimization function. Notably, there is a theoretical analysis that ascertains ( $\alpha,\beta$ )-usefulness, indicating that utility loss due to privacy constraints is bounded and scalable with increasing data.
Unsupervised Learning: Emphasizing clustering, especially with k-means, the paper discusses how well-separated clustering can be achieved through marginal sensitivity frameworks, a form of sample and aggregate methodology, and projects on feature selection via exponential mechanisms with theoretical bounds on utility.
Dimensionality Reduction: Principal component analysis (PCA) mechanisms using noise addition on eigenvector selection offer an alternative to straightforward iterative approaches. The mechanism guarantees suboptimality within known bounds, assuming sufficient separation in the eigenspectrum.
Statistical Estimators: In maximum likelihood contexts and robust statistics, mechanisms leverage subtle robustification to ensure elasticity against perturbations, reporting privacy-conserving trimmings on sensitive parameters.

Discussion and Theoretical Insights

The authors elaborate on theoretical results suggesting that all learnable problems are capable of being learned under differential privacy constraints with the appropriate mechanism choice. The potential trade-off between privacy and utility is critical, where noise addition mechanisms converge to the non-private error rate under realistic assumptions, such as in Kernel SVMs and linear models.

The survey discusses existing challenges and open questions such as handling public data sources, missing data in private contexts, and potential for achieving privacy at no substantial cost to utility. Furthermore, the work suggests that there may be alignments between the goals of generalization in ML and differential privacy, proposing an avenue for future theoretical exploration.

Implications and Future Directions

The research delineates a pivotal direction for incorporating privacy concerns directly into ML model development. The authors highlight significant work done in the field but also call for more sophisticated DP mechanisms that handle specific application-driven constraints like temporal, structural, and unbalanced data distributions. Another important contemplation for future research rests on understanding how ML generalization interacts, possibly synergistically, with privacy mechanisms.

In sum, this survey synthesis offers a crucial distillation of known methods and open challenges at the juncture of differential privacy and machine learning, underlining a roadmap for future scientific inquiry and methodology development.

PDF Markdown

Related Papers

YouTube

Show All Videos