Differentially Private Bayesian Learning on Distributed Data (1703.01106v2)

Published 3 Mar 2017 in stat.ML, cs.CR, cs.LG, and stat.CO

Abstract: Many applications of machine learning, for example in health care, would benefit from methods that can guarantee privacy of data subjects. Differential privacy (DP) has become established as a standard for protecting learning results. The standard DP algorithms require a single trusted party to have access to the entire data, which is a clear weakness. We consider DP Bayesian learning in a distributed setting, where each party only holds a single sample or a few samples of the data. We propose a learning strategy based on a secure multi-party sum function for aggregating summaries from data holders and the Gaussian mechanism for DP. Our method builds on an asymptotically optimal and practically efficient DP Bayesian inference with rapidly diminishing extra cost.

Citations (52)

View on Semantic Scholar

Summary

The paper introduces a distributed method that combines secure multi-party computation with Gaussian noise to ensure differential privacy in Bayesian learning.
It employs a Distributed Compute Algorithm to securely obfuscate and aggregate data without relying on a centralized trusted entity.
Experiments in Bayesian linear regression demonstrate near-parity performance with traditional methods, highlighting scalability and strong privacy guarantees.

Differentially Private Bayesian Learning on Distributed Data

The need for privacy-preserving methods in machine learning is becoming increasingly critical, especially in sensitive applications like healthcare. This paper addresses this requirement by proposing a framework for differentially private (DP) Bayesian learning in a distributed setting. The authors introduce a method that leverages secure multi-party computation (SMC) and the Gaussian mechanism to achieve DP without necessitating a single trusted data aggregator.

Introduction

Differential privacy has emerged as a robust standard for privacy protection, offering mathematically rigorous guarantees against data breaches even when potential attackers have access to auxiliary information. However, traditional DP methods often assume that a single, trusted party has complete access to the data set for noise addition, which presents significant privacy risks due to possible single points of failure. This research work sidesteps this limitation by developing a methodology for executing DP Bayesian learning in a fully distributed environment.

Methodology

The paper presents a general strategy integrating SMC with DP Bayesian techniques to conduct data analysis across disparate data holders, each possessing only a subset of the data. A novel Distributed Compute Algorithm (DCA) is introduced, enhancing SMC applications for private computation of sum queries. DCA allows each client to obfuscate their data via Gaussian noise, aggregating it securely across multiple parties. The mechanism operates under the assumption that communications between clients and Compute nodes are secured, utilizing symmetric cryptography for enhanced efficiency.

By employing the SMC framework, the DCA ensures that no single point in the computation chain holds all the raw data, thereby maintaining privacy while still allowing for aggregate insights essential for Bayesian inference. Additionally, the paper validates the efficiency of this approach through examples in Bayesian linear regression (BLR), demonstrating the extension of traditional Bayesian learning frameworks into a secure, distributed format.

Numerical Results and Implications

In applying the innovative algorithm within Bayesian linear regression, the researchers document significant performance efficiency across various data distributions. The incorporation of data projection further refines model accuracy by reducing added noise, effectively managing the trade-off between bias and variance. An articulated noise scaling factor in the presented algorithms ensures the privacy guarantees hold firm even as the number of colluding data holders increases.

The results, demonstrated on both simulated and real-world datasets, affirm that these distributed methods can achieve near parity in performance with non-distributed DP implementations. This aspect underlines the feasibility of deploying such privacy-preserving methods in real-world scenarios without compromising on the efficiency or accuracy of the predictions.

Future Directions

This methodology opens avenues for safe, collaborative machine learning on sensitive data where privacy is paramount. Potential future work could explore further optimization of the noise mechanisms for enhancing model accuracy or extending these techniques to other domains where secure data collaboration is required. Emphasizing scalability, researchers could also innovate ways to dynamically manage computational overhead across varying dataset sizes and complexities. Given the burgeoning interest in federated data analysis, these contributions significantly facilitate advancements in privacy across distributed infrastructures.

In summary, the proposed framework showcases a promising approach for securely implementing DP Bayesian learning across distributed agencies, substantially reducing the inherent risks associated with centralized models and thereby, paving the way for more secure, scalable, and privacy-compliant machine learning solutions.

PDF Markdown

Related Papers

YouTube

Show All Videos