Papers
Topics
Authors
Recent
2000 character limit reached

FedBiCross: OSFL for Non-IID Medical Data

Updated 12 January 2026
  • FedBiCross is a bi-level optimization framework that enables data-free one-shot federated learning by aggregating decentralized models via clustering and adaptive weight optimization.
  • It employs K-means clustering on client prediction matrices and deep inversion techniques to generate synthetic data, mitigating uniform soft label issues in non-IID settings.
  • The framework achieves significant performance gains on MedMNIST datasets and supports personalized fine-tuning for privacy-sensitive clinical applications.

FedBiCross is a bi-level optimization framework introduced for data-free one-shot federated learning (OSFL) under non-IID (non-identically independently distributed) settings, with a particular emphasis on privacy-sensitive medical imaging data. Unlike conventional federated learning solutions requiring multiple rounds of communication or direct access to raw data, FedBiCross executes knowledge aggregation in a single round through the exchange of models only. It addresses the major challenge in OSFL: the destructive effects of aggregating predictions from non-IID clients, which can result in near-uniform soft labels and inadequate supervision for distillation tasks (Xia et al., 5 Jan 2026).

1. OSFL under Non-IID Medical Data: Problem Formulation

In OSFL, each of NN clients %%%%1%%%% holds a private dataset Di\mathcal{D}_i with substantial distribution skew. After clients upload their locally trained models fif_i, the server is restricted to a single communication and aims to produce a set of personalized models {fipers}\{f_i^{\mathrm{pers}}\} tailored to each client’s data distribution.

Key notations:

  • fi(x)ΔC1f_i(\bm x) \in \Delta^{C-1}: Client ii’s soft-prediction vector for CC classes.
  • F(x)=1Ni=1Nfi(x)F(\bm x) = \frac{1}{N}\sum_{i=1}^N f_i(\bm x): Uniform ensemble teacher.
  • Knowledge distillation loss on synthetic inputs {x^}\{\hat{\bm x}\}:

LKD(G,F,{x^})=x^KL(F(x^)G(x^))L_{\mathrm{KD}}(G, F, \{\hat{\bm x}\}) = \sum_{\hat{\bm x}} \mathrm{KL}\bigl(F(\hat{\bm x})\,\|\,G(\hat{\bm x})\bigr)

Aggregation of F(x^)F(\hat{\bm x}) under high non-IID can yield soft predictions nearly uniform, diminishing the signal for student model learning.

2. Client Clustering and Sub-Ensemble Construction

To counteract “teacher disagreement” from naïve averaging, FedBiCross applies a clustering stage based on output similarity of client models.

  • Output Similarity Measurement: Generate MM random noise inputs {zm}\{\bm z_m\}; for each client, construct prediction matrix Pi=[fi(z1),,fi(zM)]RC×M\bm P_i = \bigl[f_i(\bm z_1),\,\dots,\,f_i(\bm z_M)\bigr] \in \mathbb{R}^{C \times M}.
  • Clustering via KK-Means: Partition clients into KK clusters {Ck}\{\mathcal{C}_k\} by minimizing Frobenius norm distance

min{Ck},{ck}k=1KiCkPickF2\min_{\{\mathcal{C}_k\}, \{\bm c_k\}} \sum_{k=1}^K \sum_{i \in \mathcal{C}_k} \|\bm P_i - \bm c_k\|_F^2

with centroid ck\bm c_k.

  • Sub-Ensemble Teacher Construction: For cluster kk, define Fk(x)=1CkiCkfi(x)F_k(\bm x) = \frac{1}{|\mathcal{C}_k|}\sum_{i \in \mathcal{C}_k} f_i(\bm x).
  • Deep Inversion for Synthetic Data: Synthetic batches for each cluster are generated via iterative gradient optimization:

x^k(t)=x^k(t1)ηsx^LDI(x^k(t1);Fk,y)\hat{\bm x}_k^{(t)} = \hat{\bm x}_k^{(t-1)} - \eta_s \nabla_{\hat{\bm x}} \mathcal{L}_{\mathrm{DI}}(\hat{\bm x}_k^{(t-1)}; F_k, y)

where LDI\mathcal{L}_{\mathrm{DI}} contains cross-entropy, total variation, and batch norm regularization terms.

Noise-adapted teachers F~k\tilde F_k are constructed by updating batch-normalization statistics along the inverse synthesis trajectory.

3. Bi-Level Cross-Cluster Optimization

While clustering mitigates average disagreement, each cluster’s information remains limited. Mixing data naively from other clusters risks negative transfer. FedBiCross proposes a bi-level optimization approach to learn adaptive cluster weights.

  • Bi-Level Objective:

For cluster kk with learnable weights wk=(wk,1,...,wk,K)\bm w_k = (w_{k,1},...,w_{k,K}), the aggregated objective is:

wk=argminwkt=1TLKD(t)(Gk(wk),Fk,F~k,x^k(t,val)) s.t. Gk(wk)=argminGt=1Tj=1Kwk,jLKD(t)(G,Fj,F~j,x^j(t,train))\begin{aligned} \bm w_k^* &= \arg\min_{\bm w_k} \sum_{t=1}^T \mathcal{L}_{\mathrm{KD}^{(t)}}(G_k^*(\bm w_k), F_k, \tilde F_k, \hat{\bm x}_k^{(t, \mathrm{val})}) \ \text{s.t.} ~ G_k^*(\bm w_k) &= \arg\min_G \sum_{t=1}^T \sum_{j=1}^K w_{k,j} \mathcal{L}_{\mathrm{KD}^{(t)}}(G, F_j, \tilde F_j, \hat{\bm x}_j^{(t, \mathrm{train})}) \end{aligned}

with

LKD(t)(G,F,F~,x^)=λ(t)KL(F~(x^)G(x^))+(1λ(t))KL(F(x^)G(x^))\mathcal{L}_{\mathrm{KD}^{(t)}}(G, F, \tilde F, \hat{\bm x}) = \lambda^{(t)} \mathrm{KL}(\tilde F(\hat{\bm x})\,\|\,G(\hat{\bm x})) + (1 - \lambda^{(t)}) \mathrm{KL}(F(\hat{\bm x})\,\|\,G(\hat{\bm x}))

and λ(t)=1t/T\lambda^{(t)} = 1 - t/T.

  • Online Approximation: Iteratively update GkG_k and wk\bm w_k in alternating inner/outer steps per synthesis iteration, with simplex projection for wk\bm w_k normalization.

This approach dynamically suppresses clusters that negatively impact knowledge transfer for a given cluster, enhancing the diversity and relevance of the cross-cluster guidance.

4. Personalized Distillation for Client Adaptation

Final student models Gk(T)G_k^{(T)} for each cluster are fine-tuned to yield personalized models for each client using private data.

  • Initialization: fipersGk(T)f_i^{\mathrm{pers}} \leftarrow G_k^{(T)} for iCki \in \mathcal{C}_k.
  • Objective combines:
    • Fitting local data via cross-entropy
    • Regularization for cluster knowledge preservation
    • Retention of original client bias

Lpers=LCE(fipers(x),y)+γKL(Gk(T)(x)fipers(x))+δKL(fi(x)fipers(x))\mathcal{L}_{\mathrm{pers}} = \mathcal{L}_{\mathrm{CE}}(f_i^{\mathrm{pers}}(\bm x), y) + \gamma\, \mathrm{KL}(G_k^{(T)}(\bm x) \| f_i^{\mathrm{pers}}(\bm x)) + \delta\, \mathrm{KL}(f_i(\bm x) \| f_i^{\mathrm{pers}}(\bm x))

This stage enables client-specific adaptation while maintaining federated and cluster-level knowledge.

5. Algorithmic Workflow

The methodological pipeline of FedBiCross consists of three sequential stages as illustrated below.

Stage Core Operation Output
Clustering/Data Synthesis KK-Means grouping, deep inversion-based synthetic data, teacher construction Cluster data/teachers {Fk}\{F_k\}
Bi-Level Optimization Online weight adaptation, model updates per cluster Models {Gk(T)}\{G_k^{(T)}\}, weights
Personalization Fine-tuning with local client data Personalized models {fipers}\{f_i^{\mathrm{pers}}\}

The process is further detailed in stepwise pseudocode specifying model construction, synthetic data generation, inner-outer updates, and final adaptation.

6. Experimental Evaluation and Ablation Analyses

FedBiCross is empirically validated on four MedMNIST v2 datasets: BloodMNIST, DermaMNIST, OCTMNIST, and TissueMNIST. Non-IID splits are constructed via Dirichlet(α)(\alpha) with α{0.1,0.2,0.3,0.5}\alpha \in \{0.1, 0.2, 0.3, 0.5\}. Experiments use varying client and cluster counts matched to dataset/statistical complexity.

Compared baselines include FedAvg-1, DAFL, DENSE, FedISCA, and Co-Boosting. The principal metric is average test accuracy per client. Quantitative results report that FedBiCross consistently outperforms all baselines, achieving improvements ranging from 10 to 30 points, e.g., on BloodMNIST with N=5N=5, α=0.1\alpha=0.1, FedBiCross achieves 85.57% (vs. Co-Boosting’s 54.75%).

Ablation studies confirm that:

  • Intra-cluster-only schemes underperform by 5–10 points.
  • Uniform and similarity-based cross-cluster weighting lag behind full bi-level weighting.
  • Eliminating personalization drops performance by 7–15 points.
  • Disabling clustering reduces accuracy by 10–20 points.

Qualitatively, synthetic samples generated by FedBiCross display defined medical structures, whereas competing approaches show artifacts or mode collapse.

7. Significance and Implications

FedBiCross combines clustering by prediction similarity, bi-level online optimization for knowledge selection, and personalized fine-tuning into a unified framework, improving OSFL feasibility in privacy-constrained, non-IID clinical environments. The demonstrated robustness to distributional skew and marked performance gains relative to state-of-the-art methods suggest that bi-level adaptation and careful sub-ensemble formation are critical for one-shot federated settings (Xia et al., 5 Jan 2026). A plausible implication is the generalizability of clustering-plus-bi-level optimization in wider federated setups beyond medical imaging, wherever client predictions exhibit strong heterogeneity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to FedBiCross.