Patient Clustering Improves Efficiency of Federated Machine Learning to predict mortality and hospital stay time using distributed Electronic Medical Records (1903.09296v1)

Published 22 Mar 2019 in cs.LG and stat.ML

Abstract: Electronic medical records (EMRs) supports the development of machine learning algorithms for predicting disease incidence, patient response to treatment, and other healthcare events. But insofar most algorithms have been centralized, taking little account of the decentralized, non-identically independently distributed (non-IID), and privacy-sensitive characteristics of EMRs that can complicate data collection, sharing and learning. To address this challenge, we introduced a community-based federated machine learning (CBFL) algorithm and evaluated it on non-IID ICU EMRs. Our algorithm clustered the distributed data into clinically meaningful communities that captured similar diagnoses and geological locations, and learnt one model for each community. Throughout the learning process, the data was kept local on hospitals, while locally-computed results were aggregated on a server. Evaluation results show that CBFL outperformed the baseline FL algorithm in terms of Area Under the Receiver Operating Characteristic Curve (ROC AUC), Area Under the Precision-Recall Curve (PR AUC), and communication cost between hospitals and the server. Furthermore, communities' performance difference could be explained by how dissimilar one community was to others.

Citations (332)

View on Semantic Scholar

Summary

The paper demonstrates that clustering ICU patient data in a federated learning framework improves mortality prediction with higher ROC AUC and fewer communication rounds.
It employs a denoising autoencoder and k-means clustering to efficiently group non-IID data while maintaining patient privacy.
Results show that while CBFL nears centralized model accuracy, it offers a viable solution for secure, distributed healthcare analytics.

Insights into Federated Machine Learning via Patient Clustering for ICU EMRs

This paper introduces a Community-Based Federated Learning (CBFL) approach aimed at addressing the non-IID nature of Electronic Medical Records (EMRs) used in ML applications. By leveraging decentralized data from Intensive Care Unit (ICU) patients, CBFL presents a framework that both preserves patient privacy and enhances model performance in predicting mortality and hospital stay times. The paper clearly outlines the model development, experimental setup, and benchmarking against traditional Federated Learning (FL).

Methodology

The authors employed the eICU collaborative research database as the foundation for their data, which encompasses rich, multi-dimensional ICU patient records across numerous hospitals in the United States. CBFL distinguishes itself by first clustering patient data into communities based on shared clinical features and geographical locations, followed by training individual models for each community.

Three primary steps compose the CBFL methodology:

Encoder training using a denoising autoencoder across local hospital datasets without sharing raw data.
K-means clustering of these encoded datasets to organize patients into meaningful communities.
Community-based learning where each community's model refines its parameters locally before sending updates back to a central server for aggregation.

The algorithm ensures data confidentiality as only encoded representations are transmitted, and rigorous experiments were conducted using typical ML models such as neural networks with ReLU activations and common optimization methods like Adam.

Results and Analysis

CBFL was assessed across all relevant datasets concerning mortality and prolonged ICU stay time predictions. The authors reported enhanced accuracy and reduced communication costs between hospitals compared to baseline FL models. Highlights of the results include:

Mortality Prediction: CBFL achieved a maximum ROC AUC of 0.6984 with five communities, surpassing FL, which attained a ROC AUC of 0.6895. CBFL demonstrated reduced communication rounds (75 vs. 101 for FL).
ICU Stay Time Prediction: Similar benefits were observed, with CBFL enhancing ROC AUC scores and efficiency.

However, CBFL did not outperform the gold standard of centralized learning in predictive accuracy, which attained superior scores, indicating the continuing advantage of complete data integration notwithstanding privacy trade-offs.

Implications and Future Work

The work illustrates significant advancements in federated ML for healthcare, a domain particularly sensitive to data privacy. The insights from CBFL can have practical implications, including more accurate training on diverse medical datasets, preserving patient privacy, and potentially informing treatments in real-time across geographically distributed hospitals.

Future research opportunities are envisaged in optimizing communication loads, incorporating additional patient dimensions, and refining patient clustering algorithms to accommodate broader patient characteristics such as age and clinical metrics.

Conclusion

This paper successfully presents CBFL as a robust solution to the inherent challenges of using non-IID patient data in federated environments. Its ability to maintain near-centralized learning accuracy while keeping patient data local marks a significant contribution to the field of medical informatics. The model opens avenues for further exploration in broader biomedical applications while addressing pivotal industry requirements for data privacy and security.

PDF Markdown