FedDG: Federated Domain Generalization
- FedDG is a framework that combines federated learning and domain generalization to collaboratively train models on non-IID data while ensuring data privacy.
- It leverages modern pre-trained architectures like ViT, Swin, and ConvNeXt to significantly enhance out-of-domain accuracy, as validated on benchmarks such as Office-Home and PACS.
- Pre-training strategies, including large-scale supervised and self-supervised methods, are critical for optimizing performance under diverse and constrained federated environments.
Federated Domain Generalization (FedDG) addresses the challenge of collaboratively training a global model across multiple clients with non-identically distributed data—where no raw data can be exchanged—such that the resulting model generalizes robustly to unseen, out-of-distribution target domains. The core objective is to unify the privacy constraints and scalability of federated learning (FL) with the out-of-distribution robustness sought in domain generalization (DG) (Li et al., 2023). Recent developments have illuminated both the limitations of classical FL architectures under domain shift and the critical potential of advanced pre-trained backbones, large-scale pre-training, and innovative federated optimization for enhancing out-of-domain accuracy (Raha et al., 20 Sep 2024).
1. Formal Problem Formulation
Federated Domain Generalization is defined as follows: let clients each hold a labeled dataset sampled from a client-specific distribution , with for (domain shift), but a shared conditional label distribution (same task). The goal is to collaboratively learn a global model such that the test error is minimized not only on held-in clients, but crucially on an unseen domain , where for all . Local objectives are
with global aggregation as
and the optimization is constrained to updates via federated protocols (e.g. FedAvg), without sharing raw or style-aligned data.
This federated DG setup encompasses a broader and more challenging scenario than traditional DG or FL: non-IID domain shifts (covariate, style, context), strict privacy, and the necessity for out-of-sample generalization (Raha et al., 20 Sep 2024, Li et al., 2023).
2. Modern Pre-Trained Architectures in FedDG
A key finding is that replacing conventional ResNet backbones with advanced architectures—Vision Transformers (ViT), Swin Transformers, and ConvNeXt—substantially improves out-of-domain generalization in FL settings:
- Vision Transformers (ViT): Input images are split into patches, each patch is linearly embedded and processed through global self-attention layers. ViT-S and ViT-B variants (∼22.7M and ∼87M params) support long-range contextual reasoning with quadratic complexity in the patch count.
- Swin Transformers: Implement windowed self-attention within local non-overlapping windows and use a "shifted" windowing scheme for cross-window connectivity. Swin-T/S/B span 28.8M to 88.2M parameters, providing hierarchical representations with reduced complexity.
- ConvNeXt: Purely convolutional but redesigned with large kernels, inverted bottlenecks, and LayerNorm. ConvNeXt-Nano and ConvNeXt-Base (4.3M–89.1M params) achieve competitive results with ViTs, leveraging depthwise convolution and large receptive fields (Raha et al., 20 Sep 2024).
These architectures, when equipped with suitable pre-training (see below), consistently outperform similarly sized ResNet models by a significant margin, even at much lower parameter or FLOPs budgets.
3. Pre-Training Strategies: Supervised vs. Self-Supervised
Pre-training on massive and diverse datasets is a principal driver for FedDG:
- Supervised Pre-training: Utilizes labeled datasets such as ImageNet-1K, 21K, 22K, or JFT-300M, employing standard cross-entropy loss.
- Self-Supervised Pre-training (Mask Reconstruction): Methods like BEiT rely on masked image modeling (MIM), where a substantial portion of image patches is masked and the model reconstructs the masked tokens, optimized with mean-squared error on patch embeddings or discrete tokens.
Empirical evidence demonstrates:
- On complex, subtly-shifted domains (e.g. Office-Home): Pre-training on very large, diverse datasets with supervised objectives remains preferable, though BEiT-based MIM demonstrates robustness across settings.
- On visually distinct domains (e.g. PACS): Self-supervised contrastive and MIM objectives can match or surpass supervised pre-training, as they better capture the intrinsic structure of images and encourage robust, domain-invariant features (Raha et al., 20 Sep 2024).
4. Federated Optimization and Training Protocol
The federated training loop follows a classic FedAvg pattern:
- The server broadcasts global weights .
- Each client performs local epochs of SGD to obtain :
- The server aggregates client updates:
This approach assumes model parameter compatibility across clients, with careful synchronization of advanced backbone structures and associated normalization or attention modules (Raha et al., 20 Sep 2024).
5. Empirical Insights and Performance Benchmarks
Key results under leave-one-domain-out evaluation on Office-Home (4 domains, 65 classes) and PACS (4 domains, 7 classes):
| Backbone / Pre-training | Office-Home (%) | PACS (%) |
|---|---|---|
| ResNet-50 (IN-1K) | 70.8 | 79.2 |
| ConvNeXt-Nano (4.3M, IN-22K) | >ResNet-18 | — |
| ConvNeXt-Base (89.1M, IN-22K) | 84.46 | 92.55 |
| Swin-T/S/B (IN-22K) | — | — |
| ViT-B (BEiT, IN-21K, SSL) | 78.1 | 88.0 |
- ConvNeXt-Base with IN-22K pre-training sets new benchmarks (84.46% Office-Home, 92.55% PACS).
- Small ConvNeXt models (Nano, Pico) outperform larger ResNet-18/50 architectures at a fraction of the parameter and compute cost.
- Self-supervised BEiT (ViT-B, IN-21K) surpasses prior contrastive SSL schemes but in challenging mixed-domain regimes, large-scale supervised pre-training retains an advantage.
These findings underscore that both pre-training data scale and architecture choice critically affect out-of-distribution generalization in federated heterogeneous environments (Raha et al., 20 Sep 2024).
6. Recommendations and Design Principles for Resource-Constrained FedDG
- Model selection: Under computational and communication constraints, small ConvNeXt or Swin-T architectures are optimal, surpassing ResNet-50 even at lower FLOPs. ViT-S with BEiT pre-training is a robust mid-sized alternative.
- Pre-training dataset: Larger, more diverse pre-training datasets (IN-22K, JFT-300M) provide tangible gains, especially in benchmarks with overlapping domains.
- Pre-training strategy: Select self-supervised strategies (MIM/BEiT) when target domains are visually distinct or when domain shifts are not subtle; leverage supervised large-scale pre-training on tasks with low-contrast domain gaps.
- Efficiency: Advanced models with fewer parameters can deliver state-of-the-art generalization, an effect amplified when resource and privacy constraints preclude ensemble or large-scale parameter communication.
7. Broader Implications and Future Directions
This study demonstrates that the combination of sophisticated neural architectures and broad, high-quality pre-training is pivotal for achieving state-of-the-art federated domain generalization under practical resource constraints. Notably, the gap with classical FedAvg or ResNet-based models is substantial and robust to variations in compute, communication, and dataset shift regime. Future directions include scaling these insights to even more participants, exploring hierarchical federation settings, and developing dynamic selection mechanisms for optimal backbone and pre-training choices on the fly (Raha et al., 20 Sep 2024).