Outlier-Robust Federated Learning

Updated 6 October 2025

Distributionally outlier-robust federated learning is a framework that combines DRO with explicit outlier mitigation to ensure robust model performance in heterogeneous client environments.
The algorithm leverages unbalanced optimal transport and KL divergence to build adaptive ambiguity sets that limit the adverse impact of geometric shifts and extreme data contamination.
Empirical results show significant improvements over conventional FL methods, with robust convergence, enhanced fairness, and reliable performance under worst-case scenarios.

A distributionally outlier-robust federated learning algorithm is a methodological framework for federated learning (FL) that integrates distributionally robust optimization (DRO) with explicit mechanisms to mitigate the influence of outliers and heterogeneity among client data. The approach hedges against worst-case data shifts and extreme sample contamination by constructing ambiguity sets—uncertainty sets in the space of probability distributions—that simultaneously capture both geometric perturbations and non-geometric outlier behavior. Advanced algorithmic solutions leverage unbalanced optimal transport and information-theoretic penalizations, yielding tractable formulations with provable robustness certificates, practical scalability, and empirical improvements over standard federated and DRO baselines (Wang et al., 29 Sep 2025).

1. Motivation: Outlier Resilience in Distributionally Robust Federated Learning

Federated learning enables collaborative training across many clients with data sovereignty, but is fundamentally challenged by:

Heterogeneity across clients, resulting in different underlying data distributions and temporal distribution shifts (covariate shifts);
The pervasive presence of outliers due to mislabelings, adversarial corruption, sensor failures, or rare events.

Existing DRO-based FL methods focus on constructing ambiguity sets (e.g., Wasserstein balls) around empirical distributions to guard against general distributional shifts. However, these approaches are limited by "geometry-only" constraints—e.g., the empirical support is fixed and mass transportation is constrained—making them brittle in the face of outlier observations. Outliers can enlarge the ambiguity set excessively, resulting in conservative solutions or skewed models that generalize poorly under realistic contamination.

The key advancement in distributionally outlier-robust federated learning lies in defining ambiguity sets that incorporate both geometric shifts (e.g., location, scale, or density deformations) and a penalized relaxation that curbs the undue influence of individual outlier samples. This is achieved by integrating unbalanced Wasserstein (UW) metrics and non-geometric penalization (e.g., Kullback–Leibler divergence), yielding ambiguity sets that are adaptively sensitive to the degree and type of distributional contamination.

2. Mathematical Formulation: Unbalanced Wasserstein Ambiguity Sets and DRO Objectives

The central innovation is the use of an ambiguity set based on the unbalanced Wasserstein distance: $UW(P\|\widehat{P}) = \inf_{\bar{P},\, \gamma \in \Gamma(P,\,\bar{P})} \Big[\, \mathbb{E}_{(\xi,\zeta)\sim\gamma} \big[ c(\xi, \zeta) \big] + \beta D_\text{KL}(\bar{P} \| \widehat{P}) \,\Big],$ where $P$ is a hypothetical true distribution, $\widehat{P}$ is the empirical (typically contaminated) distribution, $c(\xi,\zeta)$ is a ground cost, $\Gamma(P,\bar{P})$ is the set of couplings, $D_\text{KL}$ is the Kullback–Leibler divergence, and $\beta$ balances geometric vs. KL penalization.

The overall learning goal becomes: $\min_\theta \max_{\lambda \in \Lambda} \sup_{P: UW(P \| \widehat{P}_\lambda) \leq r} \mathbb{E}_{\xi \sim P}\big[ L(\theta, \xi) \big],$ with $\widehat{P}_\lambda$ being a convex combination of client distributions, and $L(\theta, \xi) = \ell(\theta, \xi) - h(\xi)$ , where $h(\xi)$ is a user-chosen outlier scoring penalty.

Key properties:

The KL divergence term in the ambiguity set "softens" the marginal constraints, limiting the influence of rare or contaminated samples; clean samples dominate unless outliers are consistent and high-weight.
The ambiguity set expands appropriately under geometric shifts, but remains tight under outlier contamination, yielding adaptive robustness.
By dualization and Lagrangian relaxation, the nested min--max--max can be reformulated as a tractable penalized problem: $\min_\theta \max_{\lambda \in \Lambda} \; \sum_i \lambda_i\, \mathbb{E}_{\zeta \sim \widehat{P}_i} \left[ \exp\left\{ \frac{\sup_{\xi}\, [\, L(\theta,\xi) - \rho c(\xi, \zeta)\,]}{\rho\beta} \right\} \right],$ where $\rho$ is a penalization parameter.

3. Algorithmic Framework: Decentralized Training and Robust Gradient Estimation

The outlier-robust federated learning algorithm (DOR-FL) proceeds with the following structure:

Each client $i$ samples a data point $\zeta_{i,t}$ from its local empirical distribution.
Each client approximately solves the local inner maximization:

$z_{i,t} = \arg\max_{\xi \in \Xi} \big\{ L(\theta_t, \xi) - \rho\, c(\xi, \zeta_{i,t}) \big\}$

to produce a robust virtual sample.

The client computes

$g_{i,t}^{(\theta)} = \exp\left( \frac{L(\theta_t, z_{i,t}) - \rho\, c(z_{i,t}, \zeta_{i,t})}{\rho\beta} \right) \frac{\nabla_\theta L(\theta_t, z_{i,t})}{\rho\beta}$

as a stochastic gradient estimator, and a similar estimator for $\lambda$ (the aggregation weights).

The server aggregates updates by computing a (weighted) average of client models (projected onto the model set $\Theta$ ) and updates $\lambda$ using a projected gradient ascent.

Key facts:

The presence of the exponential weighting in the gradient ensures outlier samples receive negligible influence if $L(\theta,z)$ is anomalous or $c(z, \zeta)$ is large.
The client-side inner maximization admits tractable (approximate) solvers for typical losses and metrics; the overall method is scalable.

4. Convergence Guarantees and Robustness Certificates

Under standard conditions (convexity/concavity, boundedness, strong convexity of transport cost), the algorithm enjoys:

Convergence rate: with appropriately chosen step sizes, the optimality gap is bounded as $O(T^{-1/2} + \varepsilon)$ , where $\varepsilon$ is the precision of inner maximization.
Robustness certificate: let $P^*_\lambda$ be the worst-case distribution for given $\theta$ , then for the radius $\widehat{r}_\lambda = UW(P^*_\lambda\,\|\,\widehat{P}_\lambda)$ ,

$\sup_{P: UW(P\,\|\,\widehat{P}_\lambda) \leq \widehat{r}_\lambda}\, \mathbb{E}_{\xi \sim P} L(\theta, \xi) \leq \rho \widehat{r}_\lambda + \rho\beta \log \mathbb{E}_{\zeta \sim \widehat{P}_\lambda} \left[ \exp\left( \frac{f(\theta, \zeta)}{\rho \beta} \right) \right]$

where $f(\theta, \zeta) = \sup_\xi ( L(\theta,\xi) - \rho c(\zeta,\xi) )$ . This upper bound provides a robust performance guarantee over all distributions within the ambiguity set.

5. Empirical Evidence: Synthetic and Real-World Validation

Experiments validate the approach in both stylized and practical FL settings:

Synthetic datasets with multiple clients and contamination regimes (Gaussian mean shifts, outlier injection, Wasserstein shifts) demonstrate clear improvements in test accuracy (e.g., DOR-FL attaining $\sim$ 95% vs. $61$– $77\%$ for standard, Wasserstein robust, and group DRO baselines).
Real-world data (UCI Adult Income) partitioned by demographic subgroups confirms that DOR-FL outperforms in fairness (lower excess risk for minority groups) and overall accuracy.
Outlier scoring functions $h(\xi)$ can be incorporated (e.g., penalizing samples with anomalous capital gain) to further reinforce resilience in domains prone to corruption.

6. Technical Distinctions and Comparisons

DOR-FL advances over prior methodologies by:

Avoiding the overly conservative behavior of classic DRO under severe contamination, by incorporating the KL penalization directly into the ambiguity set.
Enabling tractable, decentralized training due to the Lagrangian penalization and closed-form (or efficiently approximable) update steps for both model and aggregation weights.
Providing performance guarantees (robustness certificates) not only with respect to geometric distribution shifts but also adversarial contamination and outlier scenarios.
Achieving empirical performance gains for both mean and worst-case (group-wise) prediction risk.

7. Limitations and Future Research Directions

Open directions include:

Communication efficiency improvements: although the current method is tractable, further reduction in communication or oracle calls could scale the approach to even larger networks.
Integration with privacy and security constraints: while variables such as the threshold or loss statistics can be securely aggregated, formal privacy guarantees under the unbalanced Wasserstein model remain to be developed.
Extension to high-dimensional regimes and more complex noise/outlier patterns: tailoring the outlier scoring function $h(\xi)$ and inner maximization to deep and structured data warrants further exploration.

Table: Key Components of Outlier-Robust FL Algorithms

Component	Methodological Choice	Impact
Ambiguity Set	Unbalanced Wasserstein + KL penalty	Outlier suppression; adaptive geometric shift
Inner Maximization	Robust virtual sample per client	Locally filters distributional contamination
Aggregation	Projected, weighted server updates	Emphasizes hardest (highest-loss) clients
Robustness Certificate	Explicit upper bound on risk	Performance guarantee under contamination

In summary, distributionally outlier-robust federated learning algorithms employ advanced ambiguity sets integrating unbalanced transport and information-theoretic divergence to construct models that are simultaneously robust to distributional shifts and resilient to extreme-value outliers. These developments offer provably rigorous and empirically validated mechanisms for building reliable federated systems in the presence of heterogeneity and contamination (Wang et al., 29 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Distributionally Robust Federated Learning with Outlier Resilience (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Distributionally Outlier-Robust Federated Learning Algorithm.