Federated Multimodal Learning

Updated 8 June 2026

Federated Multimodal Learning is a decentralized framework that integrates federated and multimodal machine learning to fuse diverse data (images, audio, text) while keeping raw data local.
It leverages various fusion strategies—such as concatenation, attention-based, and parameter-efficient adapters—to handle modality heterogeneity and optimize communication efficiency.
The approach enhances privacy and compliance in sectors like healthcare, IoT, and autonomous systems, enabling robust collaborative training under variable resource constraints.

Federated Multimodal Learning (FML) integrates federated learning (FL) and multimodal machine learning, enabling decentralized collaborative training of models that fuse information from diverse data types—such as images, audio, text, sensors—while keeping raw data localized on each client for privacy and regulatory compliance. FML addresses a landscape where data are deeply heterogeneous: different clients may collect different subsets of modalities, the quality and frequency of modality acquisition varies, and communication and computational resources are highly constrained or variable. This paradigm poses unique algorithmic, architectural, and system challenges beyond those encountered in unimodal or classical FL settings and is central to applications spanning healthcare, autonomous systems, IoT, safety management, and next-generation AI foundation models.

1. Foundations: Problem Formulations and Learning Objectives

FML is instantiated across three principal FL paradigms—horizontal, vertical, and hybrid FL—each tailored for distinct patterns of data and modality partitioning across clients (Peng et al., 27 May 2025):

Horizontal FL: Clients own different samples but share a (possibly unioned) modality space. The goal is to collaboratively optimize

$\min_\theta \sum_{i=1}^M \frac{N_i}{N} F_i(\theta), \quad F_i(\theta) = \frac{1}{N_i} \sum_{(x^{(1:M)}, y) \in \mathcal{D}_i} \ell(\theta; x^{(1:M)}, y).$

Fusion is typically achieved via modality-specific encoders and a fusion block (late, attention, or co-attentive).

Vertical FL: Each client holds features (modalities) from a shared set of samples. The server coordinates fusion:

$\min_{\{\theta_0,\theta_1,...,\theta_K\}} \frac{1}{N} \sum_{i=1}^N \ell\bigl( \theta_0(h_1(x_1^i),...,h_K(x_K^i)), y^i \bigr).$

Hybrid FL: Both samples and features are partitioned; aggregation proceeds hierarchically across silos and devices.

FML systems often operate under modality heterogeneity, meaning clients may lack some modalities or possess unique configurations, requiring adaptive architectures and fusion strategies (Thrasher et al., 2023).

2. Architectures and Fusion Strategies

FML architectures are generically modular, consisting of modality-specific encoders (e.g., CNNs for images, LSTMs/Transformers for text/audio), followed by a fusion mechanism and a shared or client-personalized prediction head (Thrasher et al., 2023, Su et al., 21 Jan 2026, Liu et al., 20 Feb 2025, Yu et al., 2023).

Key fusion methods:

Concatenation: Concatenate modality embeddings before dense layers or classifiers, as deployed in end-to-end FML benchmarks (Feng et al., 2023).
Parameter-efficient fusion: In large foundation models (e.g., BLIP3o), only lightweight adapters are fine-tuned per modality/client and aggregated, drastically reducing communication (Su et al., 21 Jan 2026, Li et al., 2023).
Attention-based fusion: Use learnable attention mechanisms over modality representations, often yielding robustness under heterogeneity (Feng et al., 2023).
Decision-level and ensemble fusion: Per-modality heads "vote" or are combined at output, sometimes with Shapley- or performance-informed weighting (Yuan et al., 2023).
Contrastive and cross-modal losses: Regularize the fused space via intra- and inter-modal contrastive losses to mitigate drift and modality gaps [(Yu et al., 2023, Tan et al., 5 Mar 2026), Med-MMFL].

Recent advances support flexible, client-heterogeneous backbones (FedUMM, CreamFL), representation-level aggregation (CreamFL, FedAFD, FedMobile), and custom pruning/personalization layers for computation and communication efficiency (Nguyen et al., 10 Mar 2025).

3. Modality Heterogeneity: Missing Data, Selection, and Robust Aggregation

Addressing missing modalities and heterogeneity in modality availability is fundamental in FML:

Masking and zeroing: Missing-modality entries are zeroed in encoders and fusion layers, with meta-learning or MAML-style training enabling robust adaptation to new or absent modalities (Tran et al., 2023).
Selective communication: Clients upload only high-value modality models based on Shapley value–cost trade-offs, drastically reducing bandwidth (Yuan et al., 2023).
Knowledge distillation and imputation: Shared latent spaces (via autoencoders or conditional generators) enable local nodes to reconstruct missing modalities, along with contribution-aware aggregation (Liu et al., 20 Feb 2025).
Phase-wise/chained updates: FedMChain reduces "modality competition" by updating modalities in sequence, using error-compensation to preserve complementarity (Zhang et al., 1 Jun 2026).
Importance-scheduling and resource allocation: FlexMod schedules per-modality training using reinforcement-learning–derived importance metrics (prototypes, Shapley values), aligning updates to resource and informativeness constraints (Bian et al., 2024).

Such strategies outperform uniform, modality-agnostic approaches in both accuracy and resource utilization under heterogeneous availability and communication constraints.

4. Communication, Computation, and Privacy Considerations

FML intensifies classical FL concerns:

Communication cost: Model size scales with the number and dimensionality of modalities. Solutions include adapter-based tuning (LoRA, dual-adapters), quantization, sparsification, and selective modality upload (Su et al., 21 Jan 2026, Nguyen et al., 10 Mar 2025, Yuan et al., 2023).
Computation: Local addition of adapters, low-rank heads, or only partial encoder updates keeps resource usage feasible for edge and IoT devices.
Privacy: Modalities often encode semantically rich data, heightening privacy risks. Strategies span:
- Local differential privacy with dimension reduction and Laplace or Gaussian noise (MLDP) (Yuan et al., 14 Feb 2025).
- Secure aggregation, homomorphic encryption (HE), secret-sharing, and secure multiparty computation for parameter/embedding sharing (Thrasher et al., 2023, Li et al., 2023).
- Selective transmission of high-level representations or embeddings, never raw data (Le et al., 2023, Tan et al., 5 Mar 2026, Yu et al., 2023).

Empirical findings confirm significant reductions in communication (up to 100×) without performance trade-off if resource-aware strategies are adopted (Su et al., 21 Jan 2026, Yuan et al., 2023, Nguyen et al., 10 Mar 2025).

5. Benchmarks, Metrics, and Empirical Insights

Evaluation of FML systems leverages newly established, domain-specific standardized benchmarks:

FedMultimodal (Feng et al., 2023): Ten datasets spanning emotion recognition, activity recognition, medical imaging, and social media; robustness evaluated under missing modalities, labels, and label noise.
Med-MMFL (Chhetri et al., 4 Feb 2026): Five medical datasets (up to 4 modalities, 10 unique types)—tasks include segmentation, retrieval, classification, and VQA. Explores both natural and synthetic (Dirichlet) partitions.
Empirical metrics: Accuracy, AUROC (classification), Dice score (segmentation), recall@K (retrieval), F1 (VQA), communication/round (MB), and latency (ms/s or system-wide throughput).

Key insights include:

Attention or adaptive fusion boosts robustness under heterogeneity. FedOpt and FedProx outperform classic FedAvg in highly non-IID or label-skewed settings (Feng et al., 2023, Chhetri et al., 4 Feb 2026).
Representation-based and knowledge-distillation methods (CreamFL, FedMEKT, FedAFD) consistently outperform parameter-averaging under model, task, and modality heterogeneity (Yu et al., 2023, Le et al., 2023, Tan et al., 5 Mar 2026).
Communication-efficient parameter tuning techniques (LoRA, dual adapters) combined with sparse aggregation can achieve more than two orders of magnitude compression with negligible performance loss (Su et al., 21 Jan 2026, Nguyen et al., 10 Mar 2025).
FML better preserves privacy and supports compliance versus centralized or unimodal FL, but is sensitive to privacy-utility trade-offs and may require domain-specific privacy accounting (Yuan et al., 14 Feb 2025, Li et al., 2023).

6. Open Challenges and Future Directions

FML faces open problems at both theoretical and practical levels:

Theory: Convergence under modality and system heterogeneity; tight generalization bounds for partial, delayed, or asynchronous modality updates (Thrasher et al., 2023, Chhetri et al., 4 Feb 2026).
Scalable, robust aggregation: New protocols are required for modality- and node-aware client sampling, knowledge contribution quantification (Clustered-Shapley), and fault tolerance under missing data (Liu et al., 20 Feb 2025).
Personalization and adaptation: Client-specific model heads, adaptive fusion weights, and prompt-based adaptation for scalable deployment across edge, mobile, and institutional silos (Nguyen et al., 10 Mar 2025, Li et al., 2023).
Cross-paradigm integration: Hybrid (horizontal, vertical) FL with privacy and efficiency guarantees for both feature- and sample-partitioned multimodal data (Peng et al., 27 May 2025).
Privacy and trust: Exploring the impact of different privacy mechanisms on multimodal representation fusion, membership inference vulnerability, and secure incentive mechanisms for collaborative training (Thrasher et al., 2023, Li et al., 2023).
Benchmarking: Expansion of standardized, large-scale, real-world benchmarks encompassing more sensor types, cross-domain verticals, and real resource constraints (Feng et al., 2023, Chhetri et al., 4 Feb 2026).

Emerging directions include federated pre-training for foundation models, dynamic fusion architectures, adversarial robustness, application-specific adaptation (e.g., healthcare, urban safety, UAV networks (Shaon et al., 2 Oct 2025)), and automated incentive assignment for federated contributors (Li et al., 2023).

References: