Federated Representation Learning (FedRep)

Updated 1 July 2026

Federated Representation Learning (FedRep) is a framework that decouples models into a global shared representation and local personalized heads for efficient and private learning.
It employs sequential and anti scheduling strategies to optimize the sharing of generic features while allowing for client-specific adjustments.
The approach enhances privacy and communication efficiency by exchanging only the representation block while keeping client-specific parameters local.

Federated Representation Learning (FedRep) is a paradigm in federated learning (FL) that aims to learn effective, generalizable data representations across distributed, heterogeneous, and privacy-constrained environments. The core principle is to explicitly decouple a model into a shared representation ("body," "base," or backbone) and local personalized heads, allowing clients to collaborate on feature extraction while tailoring predictions to their own data—thereby mitigating the pitfalls of naive parameter sharing under non-IID client distributions.

1. Core Principles and Model Partitioning

Federated Representation Learning partitions model parameters into a globally shared representation (feature extractor/body) and a set of client-specific heads (classifiers or task adapters). For a parameter set $\theta_i = (\theta_{i,b}, \theta_{i,h})$ , $\theta_{i,b} \in \mathbb{R}^{d_b}$ encodes the representation shared and aggregated via the server, while $\theta_{i,h} \in \mathbb{R}^{d_h}$ remains private and local to client $i$ .

This approach generalizes many settings:

Simple embeddings (e.g., FURL, user-specific vectors) (Bui et al., 2019)
Deep architectures partitioned into base (convolutional, encoder blocks) and head (classifier, regression layers) (Collins et al., 2021, Jang et al., 2024)
Language embedding models (e.g., federated Word2Vec) (Bernal et al., 2021)

The global optimization objective is typically:

$\min_{\theta_b,\,\{\theta_{i,h}\}} \frac{1}{K} \sum_{i=1}^K \mathbb{E}_{(x,y)\sim D_i} \ell(h_{\theta_{i,h}}(\phi_{\theta_b}(x)), y)$

where $\phi_{\theta_b}$ is the shared representation.

This parameter split enables the server to aggregate only the representation block, maintaining privacy and personalization since client-specific heads remain on-device and are never communicated (Bui et al., 2019, Collins et al., 2021).

2. Optimization Algorithms and Scheduling Strategies

Canonical FedRep employs a simple alternating minimization: clients solve for their head parameters (potentially to completion) locally, then jointly update and aggregate the body $\theta_b$ across the network. The communication protocol sends only $\theta_b$ , reducing bandwidth and guarding head privacy (Collins et al., 2021, Bui et al., 2019).

Key innovations include:

Sequential Layer Expansion: Deep models allow tuning the granularity of representation sharing. Let $\{b_1,\dots,b_K\}$ be base (body) blocks:

Vanilla (Forward) Scheduling: Sequentially unfreeze from low-level feature layers upward: first train and aggregate shallow layers, then progressively deeper blocks. This curriculum-style approach first establishes global, generic features before permitting more specialized, higher-level blocks to adapt (Jang et al., 2024).
Anti (Backward) Scheduling: The reverse schedule unfreezes from deep (class-specific) to shallow layers, enabling rapid personalization under class heterogeneity (Jang et al., 2024).

Pseudocode for generic sequential expansion:

$\theta_{i,b} \in \mathbb{R}^{d_b}$ 9

Both approaches have been shown to mitigate conflicting gradients and improve personalization or communication efficiency depending on the degree of data and label heterogeneity (Jang et al., 2024).

Representation matching regularization, as in (Mostafa, 2019), further constrains local representations to remain mappable (by a small matching layer) to the broadcasted global representation, discouraging drift in feature space without additional communication overhead.

3. Theoretical Properties and Convergence

The mathematical structure of FedRep has enabled sharp theoretical analyses:

Linear Setting: Alternating minimization (exact/approximate head solves, one global gradient step on $\theta_b$ ) yields linear convergence to the ground-truth global representation under mild assumptions. With $\theta_{i,b} \in \mathbb{R}^{d_b}$ 0 clients, each holding $\theta_{i,b} \in \mathbb{R}^{d_b}$ 1 samples, the sample complexity to reach $\theta_{i,b} \in \mathbb{R}^{d_b}$ 2 error is $\theta_{i,b} \in \mathbb{R}^{d_b}$ 3 for body dimension $\theta_{i,b} \in \mathbb{R}^{d_b}$ 4 and participation ratio $\theta_{i,b} \in \mathbb{R}^{d_b}$ 5 (Collins et al., 2021).
Under-Parameterization: When the global representation dimension $\theta_{i,b} \in \mathbb{R}^{d_b}$ 6 is smaller than the rank of the collection of ground-truth models, averaging local representations can fail due to misalignment. The FLUTE algorithm addresses this by coupling standard data-fitting with regularizers that promote extraction of the top- $\theta_{i,b} \in \mathbb{R}^{d_b}$ 7 subspace across clients, attaining provable convergence and $\theta_{i,b} \in \mathbb{R}^{d_b}$ 8-fold sample efficiency over centralized baselines in that regime (Liu et al., 2024).
Maximal Coding Rate Reduction (MCR²): The FLOW algorithm replaces cross-entropy with an information-theoretic objective that promotes discriminative, class-orthogonal and within-class compressible representations; it achieves first-order stationarity guarantees and experimentally exhibits near-centralized performance (Cervino et al., 2022).

Table: Summary of Concluded Theoretical Results

Setting	Result	Reference
Linear body + heads	Linear convergence, fast rates	(Collins et al., 2021)
Under-parameterized FRL	Provable convergence, sample efficiency	(Liu et al., 2024)
Info-theoretic (FLOW)	First-order stationarity	(Cervino et al., 2022)

4. Empirical Performance and Benchmarks

Empirical evaluation consistently demonstrates Federated Representation Learning outperforms classical FL (e.g., FedAvg) and naive personalized methods across a variety of heterogeneity levels and domains:

Image classification: On CIFAR-100, sequential-layer expansion FedRep variants achieve average accuracies 59.52% (Vanilla) and 60.06% (Anti) compared to 41.24% for FedRep and 52.75% for FedBABU; Tiny-ImageNet results mirror this hierarchy (Jang et al., 2024).
Communication/Compute Efficiency: Freezing base blocks initially reduces total FLOPs by up to 64% for forward scheduling (Jang et al., 2024).
NLP: Federated Word2Vec delivers embedding quality and convergence time at par with centralized versions, with improved domain generalization of learned word vectors (Bernal et al., 2021).
Metric Learning: FLOW consistently provides better inter-class orthogonality and within-class diversity than federated or centralized cross-entropy objectives (Cervino et al., 2022).
User Embedding and Personalization: FURL demonstrates no performance drop versus centralized training (+8.39pp for FL vs. +7.85pp centralized on CTR AUC) and nearly identical user-embedding structure (Bui et al., 2019).

5. Extensions and Generalizations

Recent work has extended FedRep to new regimes:

Decentralized Collaboration: The diffusion-based algorithm Dif-AltGDmin adapts FedRep’s alternating gradient/minimization to fully decentralized networks, matching centralized convergence up to logarithmic factors in communication steps and removing the single point of server failure (Kang et al., 29 Dec 2025).
Clustered and Evolving Data: Fed-REACT combines self-supervised representation learning (causal dilated CNN encoder, contrastive SSL) with dynamic evolutionary clustering of clients, enabling personalized task learning that adapts to non-stationary, heterogeneous data (Chen et al., 8 Sep 2025).
Robustness to Feature Skew: FedCiR leverages mutual information regularization to enforce informative but client-invariant representations across highly non-IID feature distributions. This is implemented without raw data exchange by server-side variational distribution learning and per-client regularizers (Li et al., 2023).
Online Monitoring and Bandits: FCOM leverages federated ALS to jointly learn low-rank representations and client loadings under budgeted multi-armed bandit settings, tightly controlling regret and communication (Kosolwattana et al., 2024).

6. Privacy, Personalization, and System-Efficiency

FedRep architectures achieve privacy guarantees at the parameter level: client heads or embeddings are never transmitted, reducing the attack surface and enabling memory-local personalization (Bui et al., 2019). This property is intrinsic in all personalized FL that employs local head/adapter blocks (e.g., user embeddings, task heads).

Additionally, communication cost is minimized as only the representation block is exchanged. System efficiency is further enhanced by event-triggered communication (e.g., determinant-triggered aggregation in FCOM), and by postponing costly deep-layer updates until later in training (Kosolwattana et al., 2024, Jang et al., 2024). In extreme under-parameterized settings, FedRep augmented with subspace-alignment regularization remains efficient and empirically superior (Liu et al., 2024).

7. Open Problems and Future Directions

Limitations include:

Current theory covers linear and certain deep-model cases; rigorous guarantees for general nonlinear settings remain open (Collins et al., 2021).
Adaptive and decentralized clustering in representations, fully asynchronous participation, and combination with differential privacy remain active research topics (Chen et al., 8 Sep 2025).
Extending FedRep to other modalities (graphs, tabular, sequential) and more expressive, compositional heads (e.g., adapters, meta-representations) is ongoing.

A plausible implication is that hybridized methods—combining representation sharing, robust regularization, dynamic clustering, and event-driven communication—will become dominant in large-scale, heterogeneous, privacy-sensitive FL deployments.