Learning Privately from Multiparty Data (1602.03552v1)

Published 10 Feb 2016 in cs.LG and cs.CR

Abstract: Learning a classifier from private data collected by multiple parties is an important problem that has many potential applications. How can we build an accurate and differentially private global classifier by combining locally-trained classifiers from different parties, without access to any party's private data? We propose to transfer the `knowledge' of the local classifier ensemble by first creating labeled data from auxiliary unlabeled data, and then train a global $\epsilon$-differentially private classifier. We show that majority voting is too sensitive and therefore propose a new risk weighted by class probabilities estimated from the ensemble. Relative to a non-private solution, our private solution has a generalization error bounded by $O(\epsilon^{{-2}M^{-2})$} where $M$ is the number of parties. This allows strong privacy without performance loss when $M$ is large, such as in crowdsensing applications. We demonstrate the performance of our method with realistic tasks of activity recognition, network intrusion detection, and malicious URL detection.

Citations (161)

View on Semantic Scholar

Summary

Learning Privately from Multiparty Data

The paper addresses an essential problem in the domain of privacy-preserving machine learning: the creation of an accurate, differentially private global classifier from data collected by multiple parties, each of which maintains the privacy of their own data. Instead of sharing raw data, each party trains a local classifier, and these local classifiers are then used to build a more accurate global classifier.

Methodology

The authors introduce a novel two-step method. The first step involves using a classifier ensemble to generate pseudo-labels for auxiliary unlabeled data. This step facilitates knowledge transfer from the ensemble to these auxiliary samples, effectively acting as a semi-supervised learning process. The second step consists of training a global classifier on the pseudo-labeled data while assuring differential privacy through output perturbation.

The paper illustrates that using majority voting to combine ensemble decisions leads to a sensitivity that compromises differential privacy. To counter this, the authors propose a weighted empirical risk minimization approach that reduces this sensitivity. The weighted loss accounts for the fraction of positive votes among the ensemble, serving as an estimated class probability which makes it less sensitive to individual classifier decisions.

Results and Analysis

The authors demonstrate through realistic tasks such as activity recognition, network intrusion detection, and malicious URL prediction that the proposed method can deliver high accuracy without sacrificing differential privacy. The theoretical results are robust, with a generalization error bound of $O(\epsilon^{-2}M^{-2})$ , where $\epsilon$ denotes the privacy budget and $M$ is the number of parties. This finding highlights that large ensembles can significantly mitigate the performance gap typically associated with privacy constraints and ensure strong privacy protection.

Implications

The proposed approach is flexible, allowing the use of classifiers of various and mixed types. The theoretical analysis confirms that the privacy guarantee extends to all samples of a party, not just a singular sample. Practically, this is advantageous in scenarios such as crowdsensing applications, with large numbers of parties contributing to a detailed privacy-preserving analysis.

This research opens up pathways for future work in privacy-preserving ensemble learning and differential privacy mechanisms for non-traditional classifier types. While the paper largely deals with binary classification, an extension section offers insights into multiclass problems, providing a comprehensive base for building upon its contributions.

Conclusion

This paper is an important contribution to the field of multiparty data analysis within machine learning, balancing the competing demands of privacy and utility. By leveraging ensemble classifiers and auxiliary data, it innovatively circumscribes the classic drawbacks of differential privacy, enabling wide applicability in real-world scenarios where privacy concerns are paramount.