Learning Privately from Multiparty Data
The paper addresses an essential problem in the domain of privacy-preserving machine learning: the creation of an accurate, differentially private global classifier from data collected by multiple parties, each of which maintains the privacy of their own data. Instead of sharing raw data, each party trains a local classifier, and these local classifiers are then used to build a more accurate global classifier.
Methodology
The authors introduce a novel two-step method. The first step involves using a classifier ensemble to generate pseudo-labels for auxiliary unlabeled data. This step facilitates knowledge transfer from the ensemble to these auxiliary samples, effectively acting as a semi-supervised learning process. The second step consists of training a global classifier on the pseudo-labeled data while assuring differential privacy through output perturbation.
The paper illustrates that using majority voting to combine ensemble decisions leads to a sensitivity that compromises differential privacy. To counter this, the authors propose a weighted empirical risk minimization approach that reduces this sensitivity. The weighted loss accounts for the fraction of positive votes among the ensemble, serving as an estimated class probability which makes it less sensitive to individual classifier decisions.
Results and Analysis
The authors demonstrate through realistic tasks such as activity recognition, network intrusion detection, and malicious URL prediction that the proposed method can deliver high accuracy without sacrificing differential privacy. The theoretical results are robust, with a generalization error bound of O(ϵ−2M−2), where ϵ denotes the privacy budget and M is the number of parties. This finding highlights that large ensembles can significantly mitigate the performance gap typically associated with privacy constraints and ensure strong privacy protection.
Implications
The proposed approach is flexible, allowing the use of classifiers of various and mixed types. The theoretical analysis confirms that the privacy guarantee extends to all samples of a party, not just a singular sample. Practically, this is advantageous in scenarios such as crowdsensing applications, with large numbers of parties contributing to a detailed privacy-preserving analysis.
This research opens up pathways for future work in privacy-preserving ensemble learning and differential privacy mechanisms for non-traditional classifier types. While the paper largely deals with binary classification, an extension section offers insights into multiclass problems, providing a comprehensive base for building upon its contributions.
Conclusion
This paper is an important contribution to the field of multiparty data analysis within machine learning, balancing the competing demands of privacy and utility. By leveraging ensemble classifiers and auxiliary data, it innovatively circumscribes the classic drawbacks of differential privacy, enabling wide applicability in real-world scenarios where privacy concerns are paramount.