Federated Mutual Learning (FML)

Updated 1 July 2026

Federated Mutual Learning (FML) is a federated learning paradigm where clients share soft predictions to mutually distill knowledge across heterogeneous models and data.
It enhances generalization, personalization, and communication efficiency by incorporating a KL divergence-based objective in both centralized and decentralized protocols.
FML reduces communication overhead by exchanging soft labels instead of full model weights, while offering robustness against adversarial and data heterogeneity challenges.

Federated Mutual Learning (FML) defines a broad family of federated learning (FL) frameworks wherein participating clients collaborate by sharing model predictions, soft labels, or models for mutual knowledge distillation, rather than, or in addition to, exchanging raw gradient updates or full model weights. Motivated by the limitations of canonical FL (especially FedAvg) under data, model, and objective heterogeneity, FML algorithms employ distributed or bidirectional knowledge transfer based on Kullback-Leibler (KL) divergence over model outputs. FML encompasses both centralized (server-orchestrated) and decentralized (peer-to-peer) protocols, supports homogeneous or heterogeneous client architectures, and can be instantiated with or without auxiliary public data. This paradigm demonstrably improves generalization, personalization, robustness to data heterogeneity, and communication efficiency compared to classical parameter-averaging approaches.

1. Core Principles and Formal Objectives

In the canonical FML protocol—prototyped by "Federated Learning Framework via Distributed Mutual Learning"—a system consists of $K$ clients, each with private data $D_i$ and model parameters $\theta_i$ (Gupta, 3 Mar 2025). Clients also receive access to a small public dataset $X_\text{pub} = \{x_1,\dots,x_M\}$ for exchanging knowledge. The training objective for each client $i$ augments the vanilla local empirical loss with a KL-based mutual knowledge distillation term:

$\mathcal{L}_i^{\text{FML}}(\theta_i) = L_i(\theta_i) + \lambda \frac{1}{M} \sum_{m=1}^M \mathrm{KL}\left(p_i(x_m \mid \theta_i)~\|~\bar p_{-i}(x_m)\right)$

where $L_i(\theta_i)$ is the cross-entropy loss on $D_i$ , $p_i(x_m \mid \theta_i)$ is the softmax output (with optional temperature $T$ ) of client $D_i$ 0 on public example $D_i$ 1, and $D_i$ 2 denotes the arithmetic mean of all other clients' outputs on $D_i$ 3.

By setting $D_i$ 4, the formulation reduces to standard FL; for $D_i$ 5, each client is encouraged to align its output distribution to the consensus of peers on the public set. This mechanism supports model-architecture heterogeneity (since only output logits are exchanged) and can incorporate variants such as bi-directional distillation, clustering-based peer selection, or applicability to decentralized topologies (Matsuda et al., 2021, Li et al., 2020, Khalil et al., 2024, Bai et al., 11 Jun 2025, Shen et al., 2020).

2. Algorithmic Instantiations and Protocol Variants

FML implementations differ in topology, communication medium, and aggregation strategies but share key stages:

Local private update: Each client performs one or more epochs of local data optimization.
Prediction exchange: Clients compute output vectors (typically softmax probabilities) on a shared public set or their own data (depending on the protocol).
Aggregation: Server or peer(s) aggregate these output vectors (average or otherwise).
Distillation update: Clients incorporate a KL-regularized objective based on peer outputs.

A representative centralized FML training loop (Gupta, 3 Mar 2025):

$\theta_i$ 4

Variants include:

FedMe: Clients exchange full models (of potentially different architectures), perform deep mutual learning (DML) over local data, and use validation loss-based model selection for automatic architectural tuning (Matsuda et al., 2021).
Def-KT and DFML: Decentralized, peer-to-peer protocols where clients pair with others, exchange models, and conduct mutual distillation locally, obviating the need for a central server or public data (Li et al., 2020, Khalil et al., 2024).
FedMLAC: Clients maintain both a personalized local model and a lightweight, globally shared "Plug-in" model, enforcing bidirectional KL-based distillation and employing layer-wise pruning in aggregation to enhance robustness in heterogeneous and potentially adversarial environments (Bai et al., 11 Jun 2025).
FML ("meme+personal"): Each client simultaneously trains a global ("meme") model and a local personalized model, exchanging meme updates with the server and mutually distilling between meme and local on private data (Shen et al., 2020).

3. Communication and Computational Efficiency

FML protocols typically achieve substantial reductions in communication volume compared to weight-sharing baselines. For example, in a standard centralized FML, the per-round per-client communication cost is $D_i$ 6 bits (sharing $D_i$ 7 softmax vectors of dimension $D_i$ 8), versus $D_i$ 9 for full model exchange (where $\theta_i$ 0 is the parameter count) (Gupta, 3 Mar 2025). A practical case showed a 50x reduction: $\theta_i$ 1 floats (soft outputs) versus $\theta_i$ 2 floats (weights). Model-exchange variants (FedMe, Def-KT, DFML), by their nature, transmit model parameters but address heterogeneity and personalization tradeoffs not attainable with weight-averaging.

Computation overhead for each client may increase relative to FedAvg due to the need for dual-model forward and backward passes (e.g., meme and local in FML, personalized and plug-in in FedMLAC), most notably doubling per-batch cost in vanilla mutual learning (Shen et al., 2020, Bai et al., 11 Jun 2025).

4. Privacy Properties and Security Considerations

FML mitigates many privacy and inference risks associated with gradient or weight sharing. Since only softpredictions on a public or non-sensitive set are exchanged (rather than gradients on private data or full weights), risk of model inversion attacks is reduced (Gupta, 3 Mar 2025). In research to date, no formal differential privacy proofs are provided; rather, privacy claims rely on the inability to infer private data from soft outputs on known public inputs. Some FML extensions propose the addition of noise to logit vectors or further aggregation protocols to amplify privacy (Gupta, 3 Mar 2025, Shen et al., 2020).

Robustness to adversarial and byzantine participants is addressed explicitly in recent frameworks, such as FedMLAC's layer-wise pruning aggregation, which filters outlier updates based on parameter deviation statistics to defend against poisoning or corrupted-label attacks (Bai et al., 11 Jun 2025).

5. Empirical Results and Application Domains

Across multiple studies, FML frameworks yield superior accuracy, generalization, and personalization under both IID and non-IID data splits, often with faster or more stable convergence than FedAvg or parameter-averaging protocols.

For instance, on a face-mask detection task, distributed FML achieved 94.45% average accuracy, outperforming vanilla FedAvg (92.65%) and asynchronous weight aggregation (92.74%), with markedly lower communication burden (Gupta, 3 Mar 2025). In decentralized FML (e.g., Def-KT, DFML), accuracy gains of 2–5% over baselines are observed under severe heterogeneity in both data and model space (Li et al., 2020, Khalil et al., 2024). FedMLAC demonstrates 1–5% gains in F1 or accuracy for federated audio classification scenarios, particularly robust to noisy or adversarial data (Bai et al., 11 Jun 2025).

Application domains include computer vision (MNIST, CIFAR-10/100), speech and audio recognition (GSC, IEMOCAP), environmental sound recognition, and natural language tasks (Shakespeare) (Matsuda et al., 2021, Bai et al., 11 Jun 2025, Shen et al., 2020).

6. Extensions, Limitations, and Open Questions

FML algorithms have been extended to address various axes of heterogeneity:

Model-architecture heterogeneity: By aligning soft predictions rather than layers or weights, FML natively accommodates clients running distinct neural architectures (Matsuda et al., 2021, Bai et al., 11 Jun 2025, Khalil et al., 2024, Shen et al., 2020).
Data heterogeneity and personalization: Bidirectional distillation and per-client model components (local/personal/model exchange/plug-in) ensure that clients benefit from global knowledge without sacrificing performance on idiosyncratic data (Matsuda et al., 2021, Bai et al., 11 Jun 2025, Shen et al., 2020).
Decentralization and serverless protocols: Fully peer-to-peer implementations eliminate single-point failures and increase robustness (Li et al., 2020, Khalil et al., 2024).

Key limitations include the reliance, in some variants, on public datasets for prediction exchange, which may not be feasible for certain privacy settings (Gupta, 3 Mar 2025). Most theoretical analyses rely on FedAvg's convergence properties; formal non-convex convergence and privacy amplification guarantees remain largely open (Shen et al., 2020, Bai et al., 11 Jun 2025). The management of distillation hyperparameters (e.g., $\theta_i$ 3, KL annealing, mutual learning weights) and interaction with advanced privacy-preserving mechanisms (DP, encrypted aggregation) are active areas of inquiry. Adaptive scheduling and integration with robust aggregation further strengthen these frameworks (Gupta, 3 Mar 2025, Bai et al., 11 Jun 2025).

7. Representative Algorithms and Comparison Table

The FML landscape includes multiple algorithmic variants, distinguished by architecture, communication, and privacy strategies:

Framework	Centralized/Decentralized	Heterogeneous Model Support	Public Data Req.	Robustness Components	Reference
FML (loss exchange)	Centralized	Yes	Yes	Soft-prediction only	(Gupta, 3 Mar 2025)
FedMe	Centralized	Yes	Optional (for clustering)	Deep mutual learning + model selection	(Matsuda et al., 2021)
Def-KT	Decentralized	Yes	No	Peer-to-peer bidirectional distillation	(Li et al., 2020)
DFML	Decentralized	Yes (non-restrictive)	No	WSM, cyclic distillation, peer aggregation	(Khalil et al., 2024)
FedMLAC	Centralized	Partial (Plug-in fixed)	No	Layer-wise pruning aggregation (LPA)	(Bai et al., 11 Jun 2025)
FML (meme+personal)	Centralized	Yes	No	Personalized-local + meme mutual learning	(Shen et al., 2020)

Each framework demonstrates improved generalization and/or personalization under conditions of data and/or model heterogeneity, with varying privacy and robustness properties. FML has shifted the focus of federated optimization from direct parameter fusion to output-level consensus, providing a flexible foundation for next-generation FL systems.