Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data (2402.00205v2)

Published 31 Jan 2024 in cs.LG and cs.CR

Abstract: Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration. In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). It offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets; (2) it safeguards patient privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized server. We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing it enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.

Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH): Enhancing Model Generalizability without Compromising Data Privacy

Introduction to DeCaPH

The emergence of collaborative ML models in healthcare research signifies a pivotal shift towards leveraging diverse and voluminous datasets to enhance model accuracy and generalizability. However, the fundamental challenge lies in harmonizing the benefits of collaborative learning with the stringent demands of data privacy and regulatory compliance across different healthcare institutions. Addressing this challenge, the paper presents the Decentralized, Collaborative, and Privacy-preserving Machine Learning framework for Multi-Hospital Data (DeCaPH), designed to enable collaborative ML training across multiple institutions without necessitating direct data sharing or infringing upon the privacy of the datasets involved.

Framework Overview

DeCaPH is underpinned by a set of key principles:

  • Decentralization and Privacy Preservation: By circumventing the need for a centralized data repository and integrating differential privacy, DeCaPH ensures that patient data remains confidential and secure against potential privacy breaches.
  • Collaborative Learning with Data Diversification: The framework facilitates ML model training across disparate datasets hosted by different hospitals, enhancing the model's generalizability and performance.
  • Differential Privacy Standardization: DeCaPH adheres to the differential privacy (DP) paradigm, a rigorous standard for privacy protection, preventing any substantial information leakage about individual data points during the training process.

Methodological Innovations

DeCaPH incorporates several methodological innovations to achieve its objectives:

  1. Randomized Leader Selection: Ensures a flexible and dynamic coordination mechanism for model updates without a central server, enhancing the framework's robustness and scalability.
  2. Secure Aggregation: Employs cryptographic secure aggregation to merge model updates from different hospitals, ensuring that the process is impervious to information leakage.
  3. Gradient Clipping and Noise Addition: Integrates DP mechanisms to the gradient updates, which are fundamental to the model training process, thereby providing theoretical guarantees on the privacy preservation of individual data points.

Empirical Evaluation

The efficacy of DeCaPH was rigorously evaluated on three real-world healthcare datasets encompassing electronic health records, single-cell genome classification, and chest radiology image analysis. The comparisons were drawn against models trained locally within each hospital's dataset and other collaborative frameworks. The results demonstrated the superior performance of DeCaPH-trained models in terms of accuracy and generalizability, with nominal performance trade-offs for ensuring differential privacy. Specifically, the models exhibited less than 3.2% drop in performance metrics compared to non-private collaborative models while achieving up to a 16% decrease in vulnerability to privacy attacks.

Implications and Future Perspectives

The findings underscore the potential of DeCaPH to facilitate large-scale collaborative ML projects within the healthcare domain, offering a pragmatic pathway to leveraging the collective utility of multi-institutional healthcare datasets while simultaneously upholding stringent privacy protections.

The implications extend beyond academic discourse, promising substantial benefits to clinical research and patient care alike. By enabling the development of more accurate and generalizable ML models, DeCaPH can significantly enhance the predictive capabilities in various clinical applications, ranging from disease diagnosis to patient outcome prediction.

Looking ahead, the framework opens avenues for further advancements in decentralized learning protocols, exploring scalable solutions to integrate heterogeneous data sources while navigating the complex tapestry of privacy regulations and ethical considerations in healthcare data utilization. The exploration of vertical data integration, adaptation to various learning paradigms, and enhancement in privacy-preserving mechanisms stand out as promising future research directions.

In conclusion, DeCaPH emerges as a compelling framework that balances the scales between the collaborative utility of diverse healthcare datasets and the imperatives of data privacy, marking a step forward in the pursuit of advanced AI-driven healthcare solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Communication-Efficient Learning of Deep Networks from Decentralized Data. In: Singh A, Zhu XJ, editors. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA. vol. 54 of Proceedings of Machine Learning Research. PMLR; 2017. p. 1273-82. Available from: http://proceedings.mlr.press/v54/mcmahan17a.html.
  2. Learning Differentially Private Recurrent Language Models. In: International Conference on Learning Representations; 2018. Available from: https://openreview.net/forum?id=BJ0hF1Z0b.
  3. Differentially Private Federated Learning: A Client Level Perspective. CoRR. 2017;abs/1712.07557. Available from: http://arxiv.org/abs/1712.07557.
  4. End-to-end privacy preserving deep learning on multi-institutional medical imaging. Nature Machine Intelligence. 2021 Jun;3(6):473-84. Number: 6 Publisher: Nature Publishing Group. Available from: https://www.nature.com/articles/s42256-021-00337-8.
  5. Deep Learning with Differential Privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM; 2016. Available from: https://doi.org/10.1145/2976749.2978318.
  6. Swarm Learning for decentralized and confidential clinical machine learning. Nature. 2021 Jun;594(7862):265-70. Number: 7862 Publisher: Nature Publishing Group. Available from: https://www.nature.com/articles/s41586-021-03583-3.
  7. Scalable Private Learning with PATE. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net; 2018. Available from: https://openreview.net/forum?id=rkZB1XbRZ.
  8. Membership Inference Attacks Against Machine Learning Models. IEEE Computer Society; 2017. p. 3-18. ISSN: 2375-1207. Available from: https://www.computer.org/csdl/proceedings-article/sp/2017/07958568/12OmNBUAvVc.
  9. Membership Inference Attacks From First Principles. In: 2022 IEEE Symposium on Security and Privacy (SP); 2022. p. 1897-914. ISSN: 2375-1207. Available from: https://ieeexplore.ieee.org/document/9833649.
  10. Opacus: User-Friendly Differential Privacy Library in PyTorch. In: NeurIPS 2021 Workshop Privacy in Machine Learning; 2021. Available from: https://openreview.net/forum?id=EopKEYBoI-.
  11. Patient characteristics, resource use and outcomes associated with general internal medicine hospital care: the General Medicine Inpatient Initiative (GEMINI) retrospective cohort study. CMAJ Open. 2017 Dec;5(4):E842-9. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741428/.
  12. Assessing the quality of clinical and administrative data extracted from hospitals: the General Medicine Inpatient Initiative (GEMINI) experience. Journal of the American Medical Informatics Association : JAMIA. 2020 Nov;28(3):578-87. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7936532/.
  13. Pan SJ, Yang Q. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering. 2010;22(10):1345-59. Available from: https://ieeexplore.ieee.org/document/5288526.
  14. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data. 2019 Dec;6(1):317. Number: 1 Publisher: Nature Publishing Group. Available from: https://www.nature.com/articles/s41597-019-0322-0.
  15. Available from: https://doi.org/10.13026/C2JT1Q.
  16. PhysioBank, PhysioToolkit, and PhysioNet. Circulation. 2000 Jun;101(23):e215-20. Publisher: American Heart Association. Available from: https://www.ahajournals.org/doi/10.1161/01.cir.101.23.e215.
  17. Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee; 2009. p. 248-55. Available from: https://ieeexplore.ieee.org/document/5206848.
  18. Mironov I. Rényi Differential Privacy. In: 2017 IEEE 30th Computer Security Foundations Symposium (CSF); 2017. p. 263-75. Available from: https://ieeexplore.ieee.org/document/8049725.
  19. One Cell At a Time (OCAT): a unified framework to integrate and analyze single-cell RNA-seq data. Genome Biology. 2022 Apr;23(1):102. Available from: https://doi.org/10.1186/s13059-022-02659-1.
  20. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society; 2017. p. 3462-71. Available from: https://doi.ieeecomputersociety.org/10.1109/CVPR.2017.369.
  21. PadChest: A large chest x-ray image dataset with multi-label annotated reports. Medical Image Analysis. 2020;66:101797. Available from: https://www.sciencedirect.com/science/article/pii/S1361841520301614.
  22. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’19/IAAI’19/EAAI’19. AAAI Press; 2019. Available from: https://doi.org/10.1609/aaai.v33i01.3301590.
  23. On the limits of cross-domain generalization in automated X-ray prediction. In: Arbel T, Ben Ayed I, de Bruijne M, Descoteaux M, Lombaert H, Pal C, editors. Proceedings of the Third Conference on Medical Imaging with Deep Learning. vol. 121 of Proceedings of Machine Learning Research. PMLR; 2020. p. 136-55. Available from: https://proceedings.mlr.press/v121/cohen20a.html.
  24. TorchXRayVision: A library of chest X-ray datasets and models. In: Konukoglu E, Menze B, Venkataraman A, Baumgartner C, Dou Q, Albarqouni S, editors. Proceedings of The 5th International Conference on Medical Imaging with Deep Learning. vol. 172 of Proceedings of Machine Learning Research. PMLR; 2022. p. 231-49. Available from: https://proceedings.mlr.press/v172/cohen22a.html.
  25. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. CCS ’17. New York, NY, USA: Association for Computing Machinery; 2017. p. 1175–1191. Available from: https://doi.org/10.1145/3133956.3133982.
  26. Secure Single-Server Aggregation with (Poly)Logarithmic Overhead. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. CCS ’20. New York, NY, USA: Association for Computing Machinery; 2020. p. 1253–1269. Available from: https://doi.org/10.1145/3372297.3417885.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Congyu Fang (4 papers)
  2. Adam Dziedzic (47 papers)
  3. Lin Zhang (342 papers)
  4. Laura Oliva (3 papers)
  5. Amol Verma (5 papers)
  6. Fahad Razak (5 papers)
  7. Nicolas Papernot (123 papers)
  8. Bo Wang (823 papers)
Citations (4)