Federated Fine-tuning

Updated 28 May 2026

Federated fine-tuning adapts pre-trained models using decentralized data, enhancing privacy and eliminating centralization.
Methodologies involve single or multi-round aggregation protocols, making large models scalable with minimal communication.
Recent advances show models exceeding 1B parameters retain accuracy with one communication round, optimizing efficiency.

Federated fine-tuning is a distributed machine learning paradigm designed to adapt large pre-trained neural network models—such as foundation models (FMs), LLMs, and multimodal transformers—to new tasks or domains using private, decentralized datasets. This approach leverages federated learning (FL) protocols to coordinate multiple clients (devices, organizations, or silos), enabling model adaptation without the need to centralize locally held data. Federated fine-tuning is technically challenging due to the resource demands of state-of-the-art models, the statistical heterogeneity of client data, and strict communication or privacy constraints. It encompasses both full-parameter fine-tuning and modern parameter-efficient methods (adapters, LoRA, tensorization), and supports a variety of modalities, architectures, and optimization schemes. Recent work demonstrates that, for sufficiently large foundation models, even a single communication round can suffice to match classic multi-round federated aggregation in accuracy, drastically improving scalability and accessibility (Wang et al., 2024). The following sections survey the mathematical foundations, algorithmic developments, efficiency techniques, practical considerations, and open challenges in federated fine-tuning.

1. Mathematical Formulation and Problem Setup

Federated fine-tuning aims to adapt a pre-trained model $w^{(0)} \in \mathbb{R}^d$ across $K$ clients, each holding a private dataset $D_k$ of size $n_k$ . The canonical objective is to minimize the weighted sum of local empirical losses: $\min_{w \in \mathbb{R}^d} F(w) \equiv \sum_{k=1}^K p_k L_k(w)\,, \quad p_k = n_k / n\,,$ where

$L_k(w) = \frac{1}{n_k} \sum_{(x, y) \in D_k} \ell(w; x, y)$

and $\ell$ is a per-example loss such as cross-entropy (Wang et al., 2024). Assumptions commonly used include $L$ -Lipschitz smoothness and that the update norm during fine-tuning is small relative to the pre-trained initialization: $\|\nabla L_k(w) - \nabla L_k(w')\| \leq L \|w - w'\|\,; \quad \|w - w^{(0)}\| \leq \tau \|w^{(0)}\|,\; \tau < 1\,.$ In multi-modal contexts, the loss may further mix distinct modalities and tasks, leading to more complex objective decompositions (Zheng et al., 11 Jun 2025).

Clients train models locally, either updating all parameters or using parameter-efficient modules. Communication with a central server aggregates updates, usually by weighted averaging. This process repeats for $T$ rounds, unless optimized for one-shot settings. Variants include hierarchy-based aggregation (Liu et al., 27 Mar 2025), clustered or asynchronous aggregation (Ni et al., 27 Mar 2025), and client personalization techniques (Zhang et al., 2024).

2. Core Algorithms: From Multi-Round to One-Shot Aggregation

The algorithmic foundation remains federated averaging (FedAvg), where the server distributes the global $K$ 0, clients perform local training, and the server aggregates client-updated parameters: $K$ 1 with $K$ 2 obtained from local SGD/Adam runs (Wang et al., 2024).

A key advance is the realization that for large, smooth foundation models, a single communication round—each client fine-tuning all local epochs before sending one upload—produces a global model nearly indistinguishable from that obtained via multi-round FedAvg. Theoretical analysis (Theorem 1 in (Wang et al., 2024)) bounds the discrepancy as

$K$ 3

where $K$ 4 is the standard number of rounds and $K$ 5 is local training. Empirically, for models $K$ 6B parameters, the error contribution from missing inter-round aggregation ( $K$ 7) collapses, as the flatness of the loss landscape and tiny weight drift render the one-shot error negligible. This enables a reduction in communication by a factor of $K$ 8 while preserving task accuracy, a finding supported across NLP and generative vision benchmarks (Wang et al., 2024).

More broadly, federated fine-tuning frameworks now encompass:

Classic FedAvg / Multi-round fine-tuning: Robust to moderate heterogeneity but high communication (Wang et al., 2024).
One-shot or "one communication round" protocols: Extremely communication-efficient for large models if local update drift remains small (Wang et al., 2024).
Clustered, hierarchical, and asynchronous schemes: Group clients for local aggregation before a global step, mitigating straggler and heterogeneity effects (Ni et al., 27 Mar 2025, Liu et al., 27 Mar 2025).
Personalization and expert-matching: Dynamic mixture-of-experts and fine-grained adaptation per client (Zhang et al., 2024).

3. Parameter-Efficient and Modular Fine-Tuning Strategies

Due to the prohibitive resource requirements for full-model fine-tuning, federated fine-tuning has widely adopted parameter-efficient techniques:

Adapter/Prompt/BitFit Approaches: Only a small adapter, prompt, or bias vector is trained on each client; remaining weights are frozen. CLIP-bias and ViT-adapter can reach near-centralized performance in federated settings (Chen et al., 2022).
LoRA (Low-Rank Adaptation): Each trainable weight $K$ 9 in the backbone is decomposed as $D_k$ 0 with $D_k$ 1 low-rank, separately optimized and aggregated. LoRA achieves strong efficiency gains and is now the default for large LLMs in FL (Liu et al., 2024, Ghiasvand et al., 2024).
Tensorized Adapters (FedTT/FedTT+): Instead of full or LoRA matrices, adapters are tensor-train (TT) decomposed, further reducing communication by up to 10 $D_k$ 2 versus LoRA and improving robustness in non-IID splits (Ghiasvand et al., 2024).
Representation Fine-Tuning (FedReFT): Direct intervention layers are trained on internal activations rather than weights, with aggregation schemes like All-but-Me to control semantic drift (Siddika et al., 27 Aug 2025).
Mixture-of-Experts, Masking, Clustered Modules: Approaches such as FedHFT use mixture-of-masked adapters, and FedAMoLE introduces adaptive mixtures and dynamic expert assignments for heterogeneity (Ilhan et al., 15 Oct 2025, Zhang et al., 2024).
Proxy and Compressed FMs: Sub-model or compressed FM versions fine-tune all layers, with careful alignment (