- The paper presents FedUAA, which integrates a globally shared encoder with a local Temperature-Warmed Evidential Uncertainty head to jointly predict DR stage and uncertainty.
- It employs uncertainty-aware weighting and temperature scaling to dynamically adjust client contributions based on reliability metrics, addressing data non-iid challenges.
- Empirical evaluations demonstrate superior AUC performance and robustness under noise, establishing FedUAA as a reliable benchmark for clinical federated learning.
Federated Uncertainty-Aware Aggregation for Robust Fundus Diabetic Retinopathy Staging
Motivation and Problem Setting
Diabetic retinopathy (DR) staging via deep learning has advanced significantly, yet real-world clinical deployment faces two intractable barriers: (1) privacy-preserving model training across multi-institutional, inherently non-iid data, and (2) quantifying reliability in automated staging results. Existing federated learning (FL) frameworks mitigate data isolation but overlook inter-client heterogeneity and neglect predictive confidence estimation, which is vital in high-stakes clinical decision-making. This work addresses these limitations by formulating an FL paradigm that is both uncertainty-aware and dynamically responsive to client reliability.
Framework Overview and Methodology
The proposed framework, Federated Uncertainty-Aware Aggregation (FedUAA), enables collaborative DR staging across distributed institutional cohorts with distinct DR grading criteria and heterogeneous data distributions. All clients share a global encoder, ensuring alignment in high-level representation learning, while retaining a personalized, local head modeled via a Temperature-Warmed Evidential Uncertainty (TWEU) module. The FedUAA architecture is depicted in (Figure 1).

Figure 1: FedUAA overview and TWEU module; the encoder is globally aggregated, while TWEU personalizes local staging and uncertainty. UAW dynamically calibrates aggregation weights based on client reliability.
The core components are:
- Shared Encoder with Local TWEU Head: The encoder is globally aggregated to maintain a consistent latent space, but each client retains a TWEU head, which jointly predicts DR stage and an explicit uncertainty score. TWEU is based on an evidential deep learning head parameterized by a Dirichlet distribution, directly outputting per-class belief masses and a total uncertainty value.
- Temperature Scaling: To overcome over-smoothing from Dirichlet regularization and sharpen belief mass assignment, temperature scaling is applied before the final cross-entropy loss.
- Uncertainty-Aware Weighting (UAW): Instead of conventional static aggregation (e.g., FedAvg), UAW dynamically assigns aggregation weights to clients proportional to their empirical reliability, quantified via the Youden index computed on the ROC curve of the client’s uncertainty distribution versus ground truth mispredictions.
Formally, given N clients, each with model fi​(φi​,ψi​) (encoder, head) and data (Xi​,Yi​), the FedUAA objective is to minimize aggregated client loss, with contribution weights wi​ determined by normalized client reliability thresholds θi​ via softmax.
Loss Functions
The training is governed by an augmented cross-entropy loss LUce​, defined as the Dirichlet-expectation cross-entropy plus annealed KL divergence, and a temperature-warmed cross-entropy LTce​. Together, these promote calibrated evidence assignments and penalize overconfident misclassifications, essential for robust uncertainty quantification. A schedule increases the KL regularization during training.
Empirical Evaluation and Analysis
The evaluation leverages five publicly available heterogeneous DR datasets, each mapped to a federated client to best reflect real clinical non-iid environments. The model demonstrates consistently superior AUCs across all clients and in aggregate, outperforming a comprehensive suite of state-of-the-art FL baselines including FedBN, FedProx, and FedRep.
Key numerical findings include:
- FedUAA achieves the highest average AUC (0.8636) across clients, a 1.48% absolute gain over the best previous FL baseline (FedBN).
- Notably, on clients characterized by smaller data volumes or pronounced heterogeneity (e.g., Messidor, DRR, IDRiD), FedUAA outperforms the next-best method by 1.27–1.33% in AUC.
- Performance improvements are statistically significant; most average p-values in empirical tests are <0.05.
- The ablation study demonstrates that the inclusion of TWEU and UAW is critical to both performance and reliability gains.
Robustness and Reliability through Uncertainty Estimation
A significant advancement is rigorous, instance-level uncertainty quantification. TWEU enables the model to assign low uncertainty to confident, correct predictions (Figure 2a) and high uncertainty to dubious or incorrect outputs (Figure 2b), providing an automatic flag for human intervention where needed.
Figure 2: (a) Correct prediction with low uncertainty; (b) Incorrect prediction with high uncertainty; (c) FedUAA’s average AUC degrades gracefully under increasing noise, exceeding all baselines.
In addition, robustness experiments involving synthetic Gaussian noise show FedUAA maintains significantly higher AUC under data degradation, indicating resilience to real-world image quality variations (Figure 2c).
Implications and Future Directions
FedUAA’s incorporation of client-level uncertainty into aggregation dynamics sets a precedent for FL in high-risk, heterogeneous clinical applications, directly linking trust calibration with federated optimization. The decoupling of global feature encoding and local, evidence-based prediction heads allows institution-specific staging criteria without sacrificing representational consistency or privacy.
This work suggests broader applications: future FL systems for other medical tasks could leverage uncertainty-driven aggregation, improving both trustworthiness (for regulatory and clinical adoption) and generalization on non-iid cohorts. Further, the coupling with out-of-distribution detection or active learning pipelines could allow for automated escalation in medical AI workflows.
Conclusion
FedUAA establishes a compelling paradigm for federated DR staging, rigorously integrating uncertainty estimation into both prediction and aggregation. It achieves reliable and superior diagnostic accuracy across institutional boundaries and data regimes, critically advancing the viability of scalable, privacy-preserving, and trustworthy AI in clinical practice.