Federated Unpaired Foundation Models

Updated 1 February 2026

Federated Unpaired Foundation Models are a class of federated learning approaches that use frozen, large-scale foundation models combined with minimal client-specific adaptations to handle unpaired, heterogeneous data.
They employ techniques like adapter-based fine-tuning, prompt methods, and geometric alignment to enhance efficiency, privacy, and robustness across diverse modalities.
Empirical benchmarks indicate reduced communication overhead, faster convergence, and improved accuracy in multi-modal and privacy-sensitive scenarios.

Federated Unpaired Foundation Models refer to a family of federated learning (FL) frameworks that exploit the representational power of large-scale foundation models—typically pretrained on massive, centralized data—in scenarios where data held by participating clients is unpaired, non-IID, fragmented, multi-modal, or otherwise fundamentally heterogeneous and siloed. Rather than relying on classical FL assumptions (e.g., paired, labeled, IID, or even modality-aligned local data), these approaches integrate universal feature extractors or foundation model modules into the distributed training pipeline, typically in a way that eliminates or minimizes the need for adaptation of large backbone parameters. By leveraging frozen backbones, adapter-based fine-tuning, or prompt techniques, federated unpaired foundation model approaches increase efficiency, privacy, robustness, and generalization—especially in modern, real-world domain-adaptation and multi-modal settings.

1. Foundations and Motivations

Conventional FL frameworks (FedAvg and variants) assume that distributions across clients are similar or at least share common structure (label spaces, modalities). In real applications—such as cross-institution medical imaging, multi-modal sensor fusion, or federated graph analysis—these assumptions routinely break down, with clients presenting unpaired, non-overlapping data, severe class imbalance, or missing modalities. Foundation models (VMFs, LLMs, multi-modal transformers) offer universally pre-trained feature representations that are robust to such heterogeneities. Federated unpaired foundation model pipelines dictate that:

Only lightweight, domain- or client-specific modules (heads, adapters, prompts) are adapted, with the powerhouse backbone fixed at all sites.
Adaptation and aggregation are performed using only model updates, relational statistics, or auxiliary public data, never sharing private samples or raw outputs.
Alignment across modalities or domains is often enforced via geometric, kernel, or graph-based mechanisms rather than end-to-end data pairing or centralization.

Empirically and theoretically, this architecture yields improved convergence, reduced communication and computation overhead, stronger privacy guarantees, and superior robustness to domain shift, class imbalance, and client heterogeneity (Kihara et al., 10 Sep 2025, Eklund, 25 Jan 2026, Che et al., 2024, Park et al., 2024, Tastan et al., 3 Feb 2025).

2. Core Methodological Paradigms

Feature Freezing and Lightweight Adaptors

A dominant paradigm is to fix the backbone parameters $\theta_f$ of a vision, language, or multimodal foundation model, and adapt only a small head (e.g., MLP classifier), bottleneck, or a set of prompt vectors. Clients update these lightweight components on their private data using unsupervised or source-free domain adaptation (SFDA) objectives, such as pseudo-labeling, entropy minimization, and consistency regularization. Aggregation is typically performed via FedAvg or its weighted/robust variants, omitting the large backbone and thereby minimizing bandwidth (Kihara et al., 10 Sep 2025, Chang et al., 23 Apr 2025).

Mathematically, for input $x$ , the model is:

$h(x; \theta_b, \theta_g) = g\big(b(f(x; \theta_f); \theta_b); \theta_g\big),$

with $f(\cdot; \theta_f)$ fixed and only $\theta_b, \theta_g$ updated locally and aggregated globally (Kihara et al., 10 Sep 2025).

In settings where clients’ data is multi-modal and unpaired—some clients holding only images, some only text, and some missing one or more modalities—federated systems rely on large, frozen cross-modal encoders combined with modality “completion” modules (Che et al., 2024, Eklund, 25 Jan 2026). Completing the missing modality is accomplished via foundation-model-based generative networks (e.g., DALLE for text→image or BLIP for image→text), typically left frozen to avoid overfitting. Client-side contrastive objectives (InfoNCE) are augmented with cross-entropy losses to optimize joint representations, allowing the global FL model to benefit from distributed, fragmentary, and even synthetic input data.

Geometric and Kernel Alignment

To ensure that unpaired, modality- or domain-disjoint features can be meaningfully aggregated, federated unpaired foundation model pipelines often employ geometric alignment objectives. A notable example is centered kernel alignment (CKA) on Gram matrices computed over a public anchor set, which forces the local representations at each node to be aligned geometrically (i.e., up to orthogonal transformation and scaling), without requiring sharing of private data or features (Eklund, 25 Jan 2026). Subspace-stabilized low-rank adaptation techniques (GeoLoRA) and magnitude–direction decompositions (GeoDoRA) further ensure that semantic structure is harmonized across clients.

3. Algorithms and Communication Protocols

A representative generic pipeline for federated unpaired foundation models is as follows:

Server Pre-trains or Loads Foundation Model: The server pre-trains a large backbone (e.g., ViT, BERT, CLIP) or loads a public checkpoint. Source-labeled data for initial adaptation may be used and then discarded.
Broadcast Backbone or Anchor Statistics: The frozen backbone, relevant lightweight heads, adapters, or public anchor-set representations (for geometric alignment) are shared with all clients. In privacy-sensitive settings, model parameters may be encrypted or shared only as necessary (Tastan et al., 3 Feb 2025).
Client-side Local Adaptation: Each client optimizes only local heads, adapters, or prompts, using local objectives such as pseudo-label-based SFDA losses, contrastive objectives, or other minimalistic adaptation protocols.
Aggregation: Local updates are aggregated by the server using computationally efficient protocols: (i) FedAvg weighted by data volume or performance, (ii) graph- or similarity-weighted aggregation built on public (synthetic) data, or (iii) precision-weighted averaging using uncertainty estimates (Eklund, 25 Jan 2026, Che et al., 2024).
Advanced Privacy or Security: When model or data confidentiality is required, homomorphic encryption (FHE), privacy-preserving permutation, and secure aggregation protocols (MPC) are deployed to block data or model extraction (Tastan et al., 3 Feb 2025).

Communication and computation cost is minimized as only small parameter or statistical updates are transmitted per round; for example, freezing ViT backbones reduces client-server communication by ≈90× and per-sample local computation by ≈25× compared to full-model fine-tuning (Kihara et al., 10 Sep 2025).

The core methodology extends naturally to non-vision modalities and graph-structured data. In federated graph learning, the distinctive challenge is to construct a global codebook that combines mutual reinforcement intra-domain and diversity across domains. FedBook introduces two-phase aggregation: first, intra-domain collaboration through frequency-guided token semantic alignment; second, inter-domain integration by up-weighting clients with semantically distinctive codebooks. This dual-phase protocol increases intra-domain coherence and prevents the semantic collapse of the global model (Wu et al., 9 Oct 2025). Analogous techniques for modality completion and alignment are used in multi-modal FL scenarios (Che et al., 2024).

5. Privacy, Robustness, and Generalization

Federated unpaired foundation models provide inherent privacy and robustness advantages:

Privacy: Backbones are either transmitted just once, encrypted, or even kept server-side with only FHE-friendly blocks communicated. Local data or proprietary model weights are never exposed, and information exchanged (e.g., Gram matrices, synthetic data, anchor statistics) cannot be inverted to recover sensitive data (Tastan et al., 3 Feb 2025, Eklund, 25 Jan 2026).
Robustness to Heterogeneity: Strong, frozen feature extractors exhibit resilience against domain gap, class imbalance, and label skew. Empirical results show reductions of up to 50% in source-target performance drop, improved adaptation under extreme label shift, and gains in global accuracy and class recall (Kihara et al., 10 Sep 2025, Abacha et al., 2024).
Generalization and Scalability: Lightweight adaptation enables efficient onboarding of new domains, seamless extension to new clients, and direct inference for unseen data sources using the globally aggregated prompts or heads (Chang et al., 23 Apr 2025, Tastan et al., 3 Feb 2025).
Reduced Communication: Only small updates (adapters, prompts, codebooks, or Gram matrices) are exchanged, enabling scalability to large K.

6. Empirical Performance and Benchmarks

Experimental studies demonstrate the superiority of federated unpaired foundation models across modalities, architectures, and problem settings.

Method	Modality	Performance Gains	Key Features	Reference
Frozen VFM (ViT-S/B)	Vision	+10.5–16.6 MAR pts on OH, +4.2 on VisDA	Reduces comm. by 90×, comp. by 25×	(Kihara et al., 10 Sep 2025)
FedMVP	Image-Text	<8 pt accuracy drop at 80% missing modality	Modality completion, similarity-graph agg.	(Che et al., 2024)
Homog. Transformer (GeoLoRA/GeoDoRA)	Multi-modal	SOTA on MIMIC, robust privacy	CKA, public anchors, LoRA-adapters, precision	(Eklund, 25 Jan 2026)
FedBook	Graph	+2.9–4.5% vs. best federated GFM baseline	Intra-/Inter-domain dual-phase aggregation	(Wu et al., 9 Oct 2025)
DP²FL	Vision-Language	82.1–84.2% mean ACC, generalizes to new srcs	Dual prompt, performance-based aggregation	(Chang et al., 23 Apr 2025)
Double-Blind FHE-Adapters	Vision (ViT)	Acc. close to full FT; ~0.25M params updated	FHE distillation, perm. sharing, MPC agg.	(Tastan et al., 3 Feb 2025)
DPSDA-FL	Vision (Diffusion)	+9% ACC, +26% recall vs FedProx (Non-IID)	DP synthetic augmentation, no data sharing	(Abacha et al., 2024)
FedBaF	Vision, Language	Up to +37.5 pp ACC (non-IID ViT)	Foundation-bias in server-side agg., privacy	(Park et al., 2024)

Benchmarks span Office-Home, VisDA, CIFAR-10/100, SVHN, MIMIC, CUB-200, Oxford-102, and a range of graph datasets (Cora, PubMed, PCBA, HIV), consistently showing federated unpaired foundation approaches outperforming both naive FL and centralized, non-federated baselines.

7. Limitations, Open Problems, and Future Directions

While federated unpaired foundation models demonstrate strong performance and scalability, several open challenges persist:

Architectural Compatibility: Methods like FedBaF require matching backbone and FL model architectures for weight sharing, which may not generalize to all settings (Park et al., 2024).
Server-Side Complexity: Protocols involving full codebook similarity computation (e.g., FedBook) can become expensive as client or token numbers increase, motivating research into approximate or scalable computation schemes (Wu et al., 9 Oct 2025).
Synthetic Data Fidelity and Privacy: Approaches relying on generative augmentations (e.g., DPSDA-FL) require rigorous privacy analysis of differentially private synthetic sample distribution and influence on global model integrity (Abacha et al., 2024).
Inter-modality and Open-set Adaptation: Automatic clustering for domain grouping, modality completion in open-world scenarios, and extending to hierarchical or multi-granularity codebooks remain open research avenues (Wu et al., 9 Oct 2025, Che et al., 2024).
Secure aggregation and model extraction: Implementing secure aggregation via MPC and preventing model or data inversion must be balanced with communication and computation efficiency (Tastan et al., 3 Feb 2025).

A plausible implication is that optimal design in federated unpaired foundation model systems will require tight integration between frozen universal representations, privacy-preserving adaptation, problem-specific lightweight modules, and communication-efficient aggregation protocols for continued progress at scale.