Federated Fine-Tuning

Updated 22 November 2025

Federated fine-tuning is a distributed adaptation technique where edge clients collaboratively fine-tune large pre-trained models locally without sharing raw data.
It employs methods like FedAvg and parameter-efficient tuning (e.g., LoRA) to overcome challenges posed by non-IID data and resource heterogeneity.
The approach prioritizes communication efficiency, privacy, and practical deployment across devices, making it key for scalable, foundation model adaptation.

Federated fine-tuning is a model adaptation paradigm that enables distributed edge clients to collaboratively tailor large pre-trained models—such as transformers, LLMs, or multimodal architectures—to downstream tasks without sharing raw data. This workflow is motivated by strict privacy requirements, resource constraints, and increasing model parameterization. Over the past several years, federated fine-tuning (or "FedFT") and its variants have become central to privacy-preserving ML, supporting deployment of foundation models across heterogeneous environments such as mobile devices, enterprises, and wireless sensor networks.

1. Methodological Foundations and Objectives

Federated fine-tuning seeks to adapt large pre-trained models on client data that remains local, typically through rounds of collaborative optimization coordinated by a centralized or hierarchical server. The general objective is

$\min_{\theta}~ F(\theta) = \sum_{k=1}^K p_k~L_k(\theta)$

where each client $k$ with private data $D_k$ minimizes its own loss $L_k(\theta) = \frac{1}{n_k} \sum_{i=1}^{n_k} \ell(\theta; x_i^k, y_i^k)$ and $p_k$ weights by data size. This structure is preserved for both full-model and parameter-efficient fine-tuning schemes (e.g., LoRA).

The defining constraints in federated fine-tuning are:

Data privacy: No raw data or feature traces are uploaded to any server.
Resource heterogeneity: Clients may possess widely varying compute/storage capabilities.
Non-IID data distributions: Client datasets may be highly skewed, impeding naive aggregation.
Communication efficiency: Large foundation models have billions of parameters, making direct exchange impractical.

Fine-tuning thus demands algorithms that (i) personalize global models to account for local data, (ii) are lightweight enough for resource-constrained devices, and (iii) control statistical and system heterogeneity (Ni et al., 27 Mar 2025, Liu et al., 28 Dec 2024, Yan et al., 8 Jan 2025).

2. Core Algorithms and System Architectures

A variety of communication and aggregation schemas have been proposed:

Standard (FedAvg-based) Federated Fine-Tuning

Each round $t$ : Server broadcasts current model $\theta^t$ , clients perform local optimization (typically SGD or Adam) for $E$ epochs, and send parameter/model deltas $\Delta\theta_k$ to the server.
Server aggregates (usually weighted average).
Limitation: Cross-client interference under data heterogeneity causes model drift and slow convergence.

Parameter-Efficient Federated Fine-Tuning (PEFT)

Federated LoRA: Only low-rank matrices (adapters) are trainable. FedAvg aggregation is performed on the LoRA blocks, while the main backbone remains frozen (Babakniya et al., 2023, Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025, Zhao et al., 13 Oct 2025).

Hierarchical and Clustered Aggregations

Devices are grouped by similarity/data statistics. Intra-group averaging reduces local non-IID effects; group heads then synchronize at coarser intervals (Ni et al., 27 Mar 2025, Liu et al., 27 Mar 2025).
Multi-level tree topologies accommodate network tiers: device → edge aggregator → central node.

Asynchronous Federated Fine-Tuning

Clients operate on arbitrary schedules, with server-side "staleness-aware" aggregation rules (e.g., weighting updates by age) (Ni et al., 27 Mar 2025).

Split Federated Fine-Tuning

Model is partitioned, e.g., embedding/trunk at client and encoder/head at server. This reduces local memory/computation and bandwidth (only intermediate activations/gradients exchanged) (Cao et al., 24 Jul 2024, Wang et al., 3 Jul 2024, Yan et al., 8 Jan 2025).

Personalization and Model Mixtures

Mechanisms such as bi-level adapters (Ilhan et al., 15 Oct 2025), client-specific gate/mixer networks (Bian et al., 14 Mar 2025), and mixture-of-experts or expert assignment (Hu et al., 26 Aug 2025, Zhang et al., 28 Nov 2024) further decouple local and global knowledge for fine-grained adaptation.

Emergent One-Shot Aggregation

For large foundation models, one communication round of federated fine-tuning suffices to match multi-round convergence, due to smooth loss landscapes and small update magnitudes (Wang et al., 5 Dec 2024).

3. Parameter-Efficient Federated Fine-Tuning Techniques

Low-Rank Adaptation (LoRA) and Its Variants

LoRA injects trainable low-rank adapters $(A,B)$ into weight matrices: $W = W_0 + BA$ . Only $O(r(d+k))$ parameters are updated, enabling dramatic bandwidth and compute savings (Babakniya et al., 2023, Liu et al., 28 Dec 2024).
Advanced LoRA-based systems disentangle the role of "direction" (global knowledge; averaged) and "magnitude" (local adaptation; personalized) in adapters for structured aggregation (Zhao et al., 13 Oct 2025).
SLoRA stagewise initialization improves convergence under severe non-IID data (Babakniya et al., 2023).
HierFedLoRA and LEGEND allocate adapter layers and rank based on group heterogeneity and device capabilities, incorporating multi-armed bandit algorithms or resource-aware scheduling (Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025).

Federated LoRA versus Prompt/Adapter Tuning

Prompt tuning (Learned input tokens) and small-layer adapters are also widely used; these minimize communication further but can be less effective in few-shot, highly non-IID scenarios (Chen et al., 2022).
Adapter masking and resource-adaptive depth control: Clients select which adapters/layers to activate depending on their resource and statistical context, as in FedHFT and LEGEND (Ilhan et al., 15 Oct 2025, Liu et al., 28 Dec 2024).

MoE-based Federated Fine-Tuning

Sparse Mixture-of-Experts adapters (FFT-MoE, FLUX) generalize LoRA by enabling client-specific expert routing, adaptive per-client capacity, and heterogeneity-aware auxiliary losses for balanced expert utilization (Hu et al., 26 Aug 2025, Chen et al., 26 Aug 2025).
Personalization is achieved via client-specific gating, dynamic expert selection, and personalized expert fusion.

Representation Fine-Tuning

FedReFT enables direct intervention at the hidden representation level via low-rank edit subspaces, with personalized aggregation (All-But-Me) for extreme parameter/comm efficiency (Siddika et al., 27 Aug 2025).

Proxy/Compressed Model Fine-Tuning

FedPFT constructs a highly compressed “sub-FM” via layer-wise saliency pruning, combined with pre-/in-FL distillation to maintain alignment of gradients with the full model (Peng et al., 17 Apr 2024).

4. Addressing Heterogeneity: Personalization, System, and Statistical

Personalized Federated Fine-Tuning

Algorithms such as FedALT and FedAMoLE explicitly separate local ("individual") and global ("rest-of-world") model components, with dynamic mixing and assignment based on input, client clusters, or task relevancy. This mitigates harmful interference and optimizes for client-specific objectives (Bian et al., 14 Mar 2025, Zhang et al., 28 Nov 2024).

Bi-Level or Mixture Model Aggregation

Masked adapters, mixture-of-cluster adapters, and expert selection (Ilhan et al., 15 Oct 2025, Zhang et al., 28 Nov 2024, Hu et al., 26 Aug 2025) support fine-grained adaptation by partitioning aggregation into subspaces aligned to client clusters or data modes.

Resource Adaptivity

LEGEND and HierFedLoRA propose dynamic assignment of adapter depth/rank/grouping based on node capability and system constraints, optimizing depth–rank tradeoff for efficient convergence (Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025).

Wireless and Edge-Aware Fine-Tuning

Practical deployments must address downlink/uplink scheduling, quantization, client selection, bandwidth allocation, and dynamic power management (Wang et al., 3 Jul 2024, Wang et al., 5 Sep 2025, Ni et al., 27 Mar 2025).
Online learning and optimization offer tractable solutions to non-convex scheduling and resource allocation while maintaining robust convergence bounds (Wang et al., 5 Sep 2025).

5. Communication, Convergence, and Empirical Benchmarks

Communication cost is the central systems bottleneck:

For adapter/PEFT-based FT, round cost $C_{\text{round}} = K \times S \times b$ is dominated by the (reduced) adapter size $S$ and quantization bits $b$ (Ni et al., 27 Mar 2025).
Using LoRA+quantization, communication per round is reduced by orders of magnitude (e.g., $O(10^6)$ parameters vs $O(10^9)$ ) and corroborated in practical benchmarks (Babakniya et al., 2023, Wang et al., 5 Dec 2024).
Hierarchical and clustered strategies further amortize cost, with cluster leads forwarding only aggregate deltas.

A representative selection of empirical findings:

Method	Acc. Gain Over Baseline	Comm. Reduction	Convergence Speedup
FedHFT	up to +2.7%	3–122×	2–3× faster
LEGEND	up to +42%	~42%	1.5–2.8×
HierFedLoRA	+1.6–4.2%	up to 2.2×	2.1–4.3×
SLoRA	matches full FT	~10–20×	up to 90% less time
MoE/FFT-MoE	up to +38% (severe NIID)	case/dep., ~2×+	up to 12× over LoRA
DevFT	+1.3–3.3%	10.7×	4.6×

(Hu et al., 26 Aug 2025, Yan et al., 8 Jan 2025, Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025, Babakniya et al., 2023)

One-shot federated fine-tuning achieves equivalent convergence to multi-round FL for >1B-parameter models, with $T\times$ lower communication (Wang et al., 5 Dec 2024).

6. Challenges, Limitations, and Emerging Directions

Statistical Heterogeneity

Performance degrades sharply for standard FedAvg or naive LoRA+FedAvg under extreme non-IID data.
Fine-grained aggregation (e.g., by directional/magnitude separation (Zhao et al., 13 Oct 2025), MoE (Hu et al., 26 Aug 2025), or All-But-Me (Siddika et al., 27 Aug 2025)) is essential for generalization.

System Heterogeneity

Adaptation to mixed fleets (IoT, mobile, server-class) demands robust scheduling, adaptive algorithmic depth, and continuous adjustment of resource allocation.

Scaling to Ultra-Large Models

Split models and "sub-FM" proxies reduce compute/memory but encounter drift and loss of fine-grained capacity. Multi-level distillation, adaptive pruning, and staged curriculum approaches (DevFT) show promise (Peng et al., 17 Apr 2024, Wu et al., 31 Jul 2025).

Multi-Modal and Cross-Device Collaboration

Extension to vision-language and cross-modal FMs requires benchmarking pipelines (e.g., FedVLMBench), dataset diversity, and novel aggregation strategies (Zheng et al., 11 Jun 2025).

Privacy, Security, and Unlearning

Further research is required on federated unlearning, differentially private optimizer variants, and attack resistance for token-level and adapter-level federated communication (Ni et al., 27 Mar 2025, Zhang et al., 28 Nov 2024).

Automation and Meta-Learning

Automated hyperparameter optimization (e.g., adapter rank, depth, expert count), meta-learning for client clustering/routing, and dynamic task models remain open, as does large-scale standardization.

7. Summary Table of Key Federated Fine-Tuning Algorithmic Variants

Variant	Local Update	Aggregation	Personalization	Heterogeneity Mitigation
FedAvg	Full/PEFT-SGD	Weighted Average	None	None
FedLoRA	LoRA adapters	FedAvg on adapters	None	—
HierFedLoRA	LoRA	Two-level (group & global)	Device-aware group config	Grouping, dynamic frequency
LEGEND	LoRA, adaptive rank	FedAvg on variable adapters	Device-adaptive load	Layer-rank assignment
FedHFT	Masked adapters	Mixture-of-clusters + SVD	Cluster soft-assignment	Clustering, Fisher masking
FFT-MoE	Sparse MoE adapters	FedAvg + expert balancing loss	Per-client Top-K gating	Input-dependent routing
DevFT	Layer fusion	Stagewise FedAvg	Staged knowledge transfer	Progressive curriculum
FedALT	Disjoint LoRAs	Per-client RoW + Indiv. mix	Input-specific gating	Separate aggregation/mixing
FedAMoLE	LoRA MoE modules	Shared/router/expert avg	Data-driven expert assign	Reverse expert selection
SFPrompt/Split-FL	Prompt + split model	Activation/grad exchange	Local prompt only	Trunk offloaded to server
FedPFT	Compressed sub-FM	Layer/Neuron distillation	Sub-FM personalization	2-stage knowledge alignment
FedReFT	Representation edit	All-But-Me aggregation	Geometric median + blend	Semantics-aware fusion

This table structures principal methods found in (Hu et al., 26 Aug 2025, Ilhan et al., 15 Oct 2025, Bian et al., 14 Mar 2025, Babakniya et al., 2023, Wang et al., 5 Dec 2024, Liu et al., 27 Mar 2025, Liu et al., 28 Dec 2024, Yan et al., 8 Jan 2025, Zhang et al., 28 Nov 2024, Cao et al., 24 Jul 2024, Peng et al., 17 Apr 2024, Siddika et al., 27 Aug 2025).

Federated fine-tuning is a critical enabler of private, large-scale foundation model adaptation. Modern approaches span from simple FedAvg on PEFT modules to sophisticated mixtures of experts, masked adapters, client clustering, asynchronous aggregation, dynamic scheduling, staged curricula, and representation-level optimization, reflecting the evolving landscape of distributed AI under both practical and statistical heterogeneity constraints.