Federated Uncertainty-Aware Aggregation for Fundus Diabetic Retinopathy Staging

Published 23 Mar 2023 in eess.IV and cs.CV | (2303.13033v2)

Abstract: Deep learning models have shown promising performance in the field of diabetic retinopathy (DR) staging. However, collaboratively training a DR staging model across multiple institutions remains a challenge due to non-iid data, client reliability, and confidence evaluation of the prediction. To address these issues, we propose a novel federated uncertainty-aware aggregation paradigm (FedUAA), which considers the reliability of each client and produces a confidence estimation for the DR staging. In our FedUAA, an aggregated encoder is shared by all clients for learning a global representation of fundus images, while a novel temperature-warmed uncertainty head (TWEU) is utilized for each client for local personalized staging criteria. Our TWEU employs an evidential deep layer to produce the uncertainty score with the DR staging results for client reliability evaluation. Furthermore, we developed a novel uncertainty-aware weighting module (UAW) to dynamically adjust the weights of model aggregation based on the uncertainty score distribution of each client. In our experiments, we collect five publicly available datasets from different institutions to conduct a dataset for federated DR staging to satisfy the real non-iid condition. The experimental results demonstrate that our FedUAA achieves better DR staging performance with higher reliability compared to other federated learning methods. Our proposed FedUAA paradigm effectively addresses the challenges of collaboratively training DR staging models across multiple institutions, and provides a robust and reliable solution for the deployment of DR diagnosis models in real-world clinical scenarios.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (5)

View on Semantic Scholar

Summary

The paper presents FedUAA, which integrates a globally shared encoder with a local Temperature-Warmed Evidential Uncertainty head to jointly predict DR stage and uncertainty.
It employs uncertainty-aware weighting and temperature scaling to dynamically adjust client contributions based on reliability metrics, addressing data non-iid challenges.
Empirical evaluations demonstrate superior AUC performance and robustness under noise, establishing FedUAA as a reliable benchmark for clinical federated learning.

Federated Uncertainty-Aware Aggregation for Robust Fundus Diabetic Retinopathy Staging

Motivation and Problem Setting

Diabetic retinopathy (DR) staging via deep learning has advanced significantly, yet real-world clinical deployment faces two intractable barriers: (1) privacy-preserving model training across multi-institutional, inherently non-iid data, and (2) quantifying reliability in automated staging results. Existing federated learning (FL) frameworks mitigate data isolation but overlook inter-client heterogeneity and neglect predictive confidence estimation, which is vital in high-stakes clinical decision-making. This work addresses these limitations by formulating an FL paradigm that is both uncertainty-aware and dynamically responsive to client reliability.

Framework Overview and Methodology

The proposed framework, Federated Uncertainty-Aware Aggregation (FedUAA), enables collaborative DR staging across distributed institutional cohorts with distinct DR grading criteria and heterogeneous data distributions. All clients share a global encoder, ensuring alignment in high-level representation learning, while retaining a personalized, local head modeled via a Temperature-Warmed Evidential Uncertainty (TWEU) module. The FedUAA architecture is depicted in (Figure 1).

Figure 1: FedUAA overview and TWEU module; the encoder is globally aggregated, while TWEU personalizes local staging and uncertainty. UAW dynamically calibrates aggregation weights based on client reliability.

The core components are:

Shared Encoder with Local TWEU Head: The encoder is globally aggregated to maintain a consistent latent space, but each client retains a TWEU head, which jointly predicts DR stage and an explicit uncertainty score. TWEU is based on an evidential deep learning head parameterized by a Dirichlet distribution, directly outputting per-class belief masses and a total uncertainty value.
Temperature Scaling: To overcome over-smoothing from Dirichlet regularization and sharpen belief mass assignment, temperature scaling is applied before the final cross-entropy loss.
Uncertainty-Aware Weighting (UAW): Instead of conventional static aggregation (e.g., FedAvg), UAW dynamically assigns aggregation weights to clients proportional to their empirical reliability, quantified via the Youden index computed on the ROC curve of the client’s uncertainty distribution versus ground truth mispredictions.

Formally, given $N$ clients, each with model $f_i(\varphi_i, \psi_i)$ (encoder, head) and data $(X_i, Y_i)$ , the FedUAA objective is to minimize aggregated client loss, with contribution weights $w_i$ determined by normalized client reliability thresholds $\theta_i$ via softmax.

Loss Functions

The training is governed by an augmented cross-entropy loss $L_\text{Uce}$ , defined as the Dirichlet-expectation cross-entropy plus annealed KL divergence, and a temperature-warmed cross-entropy $L_\text{Tce}$ . Together, these promote calibrated evidence assignments and penalize overconfident misclassifications, essential for robust uncertainty quantification. A schedule increases the KL regularization during training.

Empirical Evaluation and Analysis

The evaluation leverages five publicly available heterogeneous DR datasets, each mapped to a federated client to best reflect real clinical non-iid environments. The model demonstrates consistently superior AUCs across all clients and in aggregate, outperforming a comprehensive suite of state-of-the-art FL baselines including FedBN, FedProx, and FedRep.

Key numerical findings include:

FedUAA achieves the highest average AUC (0.8636) across clients, a 1.48% absolute gain over the best previous FL baseline (FedBN).
Notably, on clients characterized by smaller data volumes or pronounced heterogeneity (e.g., Messidor, DRR, IDRiD), FedUAA outperforms the next-best method by 1.27–1.33% in AUC.
Performance improvements are statistically significant; most average p-values in empirical tests are <0.05.
The ablation study demonstrates that the inclusion of TWEU and UAW is critical to both performance and reliability gains.

Robustness and Reliability through Uncertainty Estimation

A significant advancement is rigorous, instance-level uncertainty quantification. TWEU enables the model to assign low uncertainty to confident, correct predictions (Figure 2a) and high uncertainty to dubious or incorrect outputs (Figure 2b), providing an automatic flag for human intervention where needed.

Figure 2: (a) Correct prediction with low uncertainty; (b) Incorrect prediction with high uncertainty; (c) FedUAA’s average AUC degrades gracefully under increasing noise, exceeding all baselines.

In addition, robustness experiments involving synthetic Gaussian noise show FedUAA maintains significantly higher AUC under data degradation, indicating resilience to real-world image quality variations (Figure 2c).

Implications and Future Directions

FedUAA’s incorporation of client-level uncertainty into aggregation dynamics sets a precedent for FL in high-risk, heterogeneous clinical applications, directly linking trust calibration with federated optimization. The decoupling of global feature encoding and local, evidence-based prediction heads allows institution-specific staging criteria without sacrificing representational consistency or privacy.

This work suggests broader applications: future FL systems for other medical tasks could leverage uncertainty-driven aggregation, improving both trustworthiness (for regulatory and clinical adoption) and generalization on non-iid cohorts. Further, the coupling with out-of-distribution detection or active learning pipelines could allow for automated escalation in medical AI workflows.

Conclusion

FedUAA establishes a compelling paradigm for federated DR staging, rigorously integrating uncertainty estimation into both prediction and aggregation. It achieves reliable and superior diagnostic accuracy across institutional boundaries and data regimes, critically advancing the viability of scalable, privacy-preserving, and trustworthy AI in clinical practice.

Markdown Report Issue