Papers
Topics
Authors
Recent
2000 character limit reached

FedDG: Federated Domain Generalization

Updated 18 December 2025
  • FedDG is a framework that combines federated learning and domain generalization to collaboratively train models on non-IID data while ensuring data privacy.
  • It leverages modern pre-trained architectures like ViT, Swin, and ConvNeXt to significantly enhance out-of-domain accuracy, as validated on benchmarks such as Office-Home and PACS.
  • Pre-training strategies, including large-scale supervised and self-supervised methods, are critical for optimizing performance under diverse and constrained federated environments.

Federated Domain Generalization (FedDG) addresses the challenge of collaboratively training a global model across multiple clients with non-identically distributed data—where no raw data can be exchanged—such that the resulting model generalizes robustly to unseen, out-of-distribution target domains. The core objective is to unify the privacy constraints and scalability of federated learning (FL) with the out-of-distribution robustness sought in domain generalization (DG) (Li et al., 2023). Recent developments have illuminated both the limitations of classical FL architectures under domain shift and the critical potential of advanced pre-trained backbones, large-scale pre-training, and innovative federated optimization for enhancing out-of-domain accuracy (Raha et al., 20 Sep 2024).

1. Formal Problem Formulation

Federated Domain Generalization is defined as follows: let NN clients {ck}k=1N\{c_k\}_{k=1}^N each hold a labeled dataset Dk\mathcal{D}_k sampled from a client-specific distribution PX,Y(k)P^{(k)}_{X,Y}, with PX,Y(k)PX,Y()P^{(k)}_{X,Y} \neq P^{(\ell)}_{X,Y} for kk \neq \ell (domain shift), but a shared conditional label distribution PYX(k)=PYX()P^{(k)}_{Y|X} = P^{(\ell)}_{Y|X} (same task). The goal is to collaboratively learn a global model F(θ;x)\mathcal{F}(\theta; x) such that the test error is minimized not only on held-in clients, but crucially on an unseen domain UU, where PX,Y(U)PX,Y(k)P^{(U)}_{X,Y} \neq P^{(k)}_{X,Y} for all kk. Local objectives are

Lk(θ)=1nk(x,y)Dk(F(θ;x),y)\mathcal{L}_k(\theta) = \frac{1}{n_k} \sum_{(x,y) \in \mathcal{D}_k} \ell(\mathcal{F}(\theta; x), y)

with global aggregation as

minθL(θ)=k=1NnkjnjLk(θ)\min_\theta L(\theta) = \sum_{k=1}^N \frac{n_k}{\sum_j n_j} \mathcal{L}_k(\theta)

and the optimization is constrained to updates via federated protocols (e.g. FedAvg), without sharing raw or style-aligned data.

This federated DG setup encompasses a broader and more challenging scenario than traditional DG or FL: non-IID domain shifts (covariate, style, context), strict privacy, and the necessity for out-of-sample generalization (Raha et al., 20 Sep 2024, Li et al., 2023).

2. Modern Pre-Trained Architectures in FedDG

A key finding is that replacing conventional ResNet backbones with advanced architectures—Vision Transformers (ViT), Swin Transformers, and ConvNeXt—substantially improves out-of-domain generalization in FL settings:

  • Vision Transformers (ViT): Input images are split into 16×1616 \times 16 patches, each patch is linearly embedded and processed through global self-attention layers. ViT-S and ViT-B variants (∼22.7M and ∼87M params) support long-range contextual reasoning with quadratic complexity in the patch count.
  • Swin Transformers: Implement windowed self-attention within local non-overlapping windows and use a "shifted" windowing scheme for cross-window connectivity. Swin-T/S/B span 28.8M to 88.2M parameters, providing hierarchical representations with reduced complexity.
  • ConvNeXt: Purely convolutional but redesigned with large kernels, inverted bottlenecks, and LayerNorm. ConvNeXt-Nano and ConvNeXt-Base (4.3M–89.1M params) achieve competitive results with ViTs, leveraging depthwise convolution and large receptive fields (Raha et al., 20 Sep 2024).

These architectures, when equipped with suitable pre-training (see below), consistently outperform similarly sized ResNet models by a significant margin, even at much lower parameter or FLOPs budgets.

3. Pre-Training Strategies: Supervised vs. Self-Supervised

Pre-training on massive and diverse datasets is a principal driver for FedDG:

  • Supervised Pre-training: Utilizes labeled datasets such as ImageNet-1K, 21K, 22K, or JFT-300M, employing standard cross-entropy loss.
  • Self-Supervised Pre-training (Mask Reconstruction): Methods like BEiT rely on masked image modeling (MIM), where a substantial portion of image patches is masked and the model reconstructs the masked tokens, optimized with mean-squared error on patch embeddings or discrete tokens.

Empirical evidence demonstrates:

  • On complex, subtly-shifted domains (e.g. Office-Home): Pre-training on very large, diverse datasets with supervised objectives remains preferable, though BEiT-based MIM demonstrates robustness across settings.
  • On visually distinct domains (e.g. PACS): Self-supervised contrastive and MIM objectives can match or surpass supervised pre-training, as they better capture the intrinsic structure of images and encourage robust, domain-invariant features (Raha et al., 20 Sep 2024).

4. Federated Optimization and Training Protocol

The federated training loop follows a classic FedAvg pattern:

  • The server broadcasts global weights θt\theta^t.
  • Each client performs EE local epochs of SGD to obtain θkt+1\theta_k^{t+1}:

θkt+1=θtηLk(θt)\theta_k^{t+1} = \theta^t - \eta \nabla \mathcal{L}_k(\theta^t)

  • The server aggregates client updates:

θt+1=k=1Nnkjnjθkt+1\theta^{t+1} = \sum_{k=1}^N \frac{n_k}{\sum_j n_j} \theta_k^{t+1}

This approach assumes model parameter compatibility across clients, with careful synchronization of advanced backbone structures and associated normalization or attention modules (Raha et al., 20 Sep 2024).

5. Empirical Insights and Performance Benchmarks

Key results under leave-one-domain-out evaluation on Office-Home (4 domains, 65 classes) and PACS (4 domains, 7 classes):

Backbone / Pre-training Office-Home (%) PACS (%)
ResNet-50 (IN-1K) 70.8 79.2
ConvNeXt-Nano (4.3M, IN-22K) >ResNet-18
ConvNeXt-Base (89.1M, IN-22K) 84.46 92.55
Swin-T/S/B (IN-22K)
ViT-B (BEiT, IN-21K, SSL) 78.1 88.0
  • ConvNeXt-Base with IN-22K pre-training sets new benchmarks (84.46% Office-Home, 92.55% PACS).
  • Small ConvNeXt models (Nano, Pico) outperform larger ResNet-18/50 architectures at a fraction of the parameter and compute cost.
  • Self-supervised BEiT (ViT-B, IN-21K) surpasses prior contrastive SSL schemes but in challenging mixed-domain regimes, large-scale supervised pre-training retains an advantage.

These findings underscore that both pre-training data scale and architecture choice critically affect out-of-distribution generalization in federated heterogeneous environments (Raha et al., 20 Sep 2024).

6. Recommendations and Design Principles for Resource-Constrained FedDG

  • Model selection: Under computational and communication constraints, small ConvNeXt or Swin-T architectures are optimal, surpassing ResNet-50 even at lower FLOPs. ViT-S with BEiT pre-training is a robust mid-sized alternative.
  • Pre-training dataset: Larger, more diverse pre-training datasets (IN-22K, JFT-300M) provide tangible gains, especially in benchmarks with overlapping domains.
  • Pre-training strategy: Select self-supervised strategies (MIM/BEiT) when target domains are visually distinct or when domain shifts are not subtle; leverage supervised large-scale pre-training on tasks with low-contrast domain gaps.
  • Efficiency: Advanced models with fewer parameters can deliver state-of-the-art generalization, an effect amplified when resource and privacy constraints preclude ensemble or large-scale parameter communication.

7. Broader Implications and Future Directions

This study demonstrates that the combination of sophisticated neural architectures and broad, high-quality pre-training is pivotal for achieving state-of-the-art federated domain generalization under practical resource constraints. Notably, the gap with classical FedAvg or ResNet-based models is substantial and robust to variations in compute, communication, and dataset shift regime. Future directions include scaling these insights to even more participants, exploring hierarchical federation settings, and developing dynamic selection mechanisms for optimal backbone and pre-training choices on the fly (Raha et al., 20 Sep 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Federated Domain Generalization (FedDG).