What Can We Learn Privately? (0803.0924v3)

Published 6 Mar 2008 in cs.LG, cs.CC, cs.CR, and cs.DB

Abstract: Learning problems form an important category of computational tasks that generalizes many of the computations researchers apply to large real-life data sets. We ask: what concept classes can be learned privately, namely, by an algorithm whose output does not depend too heavily on any one input or specific training example? More precisely, we investigate learning algorithms that satisfy differential privacy, a notion that provides strong confidentiality guarantees in contexts where aggregate information is released about a database containing sensitive information about individuals. We demonstrate that, ignoring computational constraints, it is possible to privately agnostically learn any concept class using a sample size approximately logarithmic in the cardinality of the concept class. Therefore, almost anything learnable is learnable privately: specifically, if a concept class is learnable by a (non-private) algorithm with polynomial sample complexity and output size, then it can be learned privately using a polynomial number of samples. We also present a computationally efficient private PAC learner for the class of parity functions. Local (or randomized response) algorithms are a practical class of private algorithms that have received extensive investigation. We provide a precise characterization of local private learning algorithms. We show that a concept class is learnable by a local algorithm if and only if it is learnable in the statistical query (SQ) model. Finally, we present a separation between the power of interactive and noninteractive local learning algorithms.

Authors (5)

Shiva Prasad Kasiviswanathan (28 papers)
Homin K. Lee (6 papers)
Kobbi Nissim (47 papers)
Sofya Raskhodnikova (29 papers)
Adam Smith (96 papers)

Citations (1,379)

View on Semantic Scholar

Summary

Differentially Private Learning: Balancing Privacy and Utility

The paper "What Can We Learn Privately?" by Kasiviswanathan et al. provides a comprehensive examination of the intersection between machine learning and differential privacy. This discussion is crucial in the current digital landscape where large datasets containing sensitive information are prevalent, and the need to maintain individual privacy while extracting useful learning models from such data is imperative.

The paper purposefully defines and addresses the problem of private learning—specifically, learning concept classes while ensuring that the output of the learning algorithm does not substantially depend on any single data point. This characteristic is formalized through the notion of differential privacy, a strong and robust privacy guarantee that holds even against adversaries with arbitrary auxiliary information.

Key Contributions

Private Agnostic Learning

The paper introduces a generic private agnostic learning algorithm based on the exponential mechanism of McSherry and Talwar. The essential finding here is that an exponential sampling algorithm can achieve agnostic learning for any concept class $\mathcal{C}$ with a sample size approximately logarithmic in the cardinality of $\mathcal{C}$ . This result demonstrates that privacy can be maintained with only a moderate increase in sample complexity. Notably, the sample complexity essentially matches the bound given by the classical Occam’s Razor theorem, albeit with an additional dependency on the privacy parameter $\epsilon$ .

Efficient Learning of Parity Functions

The researchers also present an efficient differentially private learning algorithm for the class of parity functions. This result is pivotal as it dispels the compounding between learning with noise and learning under privacy constraints. The parity functions are significantly challenging to learn in the presence of random noise, evidenced by their difficulty in the SQ model. However, the paper shows that these functions can be learned efficiently while preserving differential privacy, underscoring a nuanced understanding of private learning different from noise-tolerant learning.

Local Algorithms and Statistical Query (SQ) Learning

The paper bridges local algorithms with the SQ model through detailed equivalence proofs. Local algorithms, where each individual's data is randomized independently before any analysis, have practical precedence due to their simplicity and strong privacy guarantees. The authors establish that a concept class is learnable by a local algorithm if and only if it is learnable in the SQ model. This equivalence is significant as it connects two separately studied domains and provides a deeper insight into the inherent capabilities and limitations of local differential privacy.

Separation of Interactive and Noninteractive Learning

An important theoretical result is the demonstrated separation between interactive and noninteractive local algorithms. The paper constructs the masked-parity concept class, which is efficiently learnable through interactive local algorithms but requires exponentially more samples when constrained to noninteractive algorithms. This separation underscores the additional power provided by interactivity and highlights the practical constraints that noninteractive approaches might face.

Implications and Future Directions

The implications of these findings are manifold:

Practical Feasibility: The generic private learning results imply that almost anything learnable in the non-private setting remains learnable under privacy constraints, suggesting broad practical applicability.
Nuances of Privacy vs. Noise: The distinction between learning with noise and private learning elucidates different response mechanisms of learning algorithms to small data changes, critical for designing robust private learners.
Adaptivity in Learning: The separation result between adaptive and nonadaptive learning models speaks to the structural advantages of allowing interaction in learning protocols, which while more complex, yield significantly better sample efficiency.

Future work will likely explore the costs associated with differentially private learning, particularly within specific practical contexts such as high-dimensional data, streaming data, or real-time systems. Moreover, identifying additional classes of functions that can be efficiently privately learned and refining the bounds on required resources such as samples and computational time remain pertinent directions for research in this evolving field.

To conclude, the paper "What Can We Learn Privately?" by Kasiviswanathan et al. represents a significant step towards understanding private learning. By elucidating the theoretical foundations and practical algorithms, it opens avenues for deploying robust, privacy-preserving learning systems on sensitive data.

PDF Markdown