Papers
Topics
Authors
Recent
2000 character limit reached

Federated Fine-Tuning

Updated 22 November 2025
  • Federated fine-tuning is a distributed adaptation technique where edge clients collaboratively fine-tune large pre-trained models locally without sharing raw data.
  • It employs methods like FedAvg and parameter-efficient tuning (e.g., LoRA) to overcome challenges posed by non-IID data and resource heterogeneity.
  • The approach prioritizes communication efficiency, privacy, and practical deployment across devices, making it key for scalable, foundation model adaptation.

Federated fine-tuning is a model adaptation paradigm that enables distributed edge clients to collaboratively tailor large pre-trained models—such as transformers, LLMs, or multimodal architectures—to downstream tasks without sharing raw data. This workflow is motivated by strict privacy requirements, resource constraints, and increasing model parameterization. Over the past several years, federated fine-tuning (or "FedFT") and its variants have become central to privacy-preserving ML, supporting deployment of foundation models across heterogeneous environments such as mobile devices, enterprises, and wireless sensor networks.

1. Methodological Foundations and Objectives

Federated fine-tuning seeks to adapt large pre-trained models on client data that remains local, typically through rounds of collaborative optimization coordinated by a centralized or hierarchical server. The general objective is

minθ F(θ)=k=1Kpk Lk(θ)\min_{\theta}~ F(\theta) = \sum_{k=1}^K p_k~L_k(\theta)

where each client kk with private data DkD_k minimizes its own loss Lk(θ)=1nki=1nk(θ;xik,yik)L_k(\theta) = \frac{1}{n_k} \sum_{i=1}^{n_k} \ell(\theta; x_i^k, y_i^k) and pkp_k weights by data size. This structure is preserved for both full-model and parameter-efficient fine-tuning schemes (e.g., LoRA).

The defining constraints in federated fine-tuning are:

  • Data privacy: No raw data or feature traces are uploaded to any server.
  • Resource heterogeneity: Clients may possess widely varying compute/storage capabilities.
  • Non-IID data distributions: Client datasets may be highly skewed, impeding naive aggregation.
  • Communication efficiency: Large foundation models have billions of parameters, making direct exchange impractical.

Fine-tuning thus demands algorithms that (i) personalize global models to account for local data, (ii) are lightweight enough for resource-constrained devices, and (iii) control statistical and system heterogeneity (Ni et al., 27 Mar 2025, Liu et al., 28 Dec 2024, Yan et al., 8 Jan 2025).

2. Core Algorithms and System Architectures

A variety of communication and aggregation schemas have been proposed:

Standard (FedAvg-based) Federated Fine-Tuning

  • Each round tt: Server broadcasts current model θt\theta^t, clients perform local optimization (typically SGD or Adam) for EE epochs, and send parameter/model deltas Δθk\Delta\theta_k to the server.
  • Server aggregates (usually weighted average).
  • Limitation: Cross-client interference under data heterogeneity causes model drift and slow convergence.

Parameter-Efficient Federated Fine-Tuning (PEFT)

Hierarchical and Clustered Aggregations

  • Devices are grouped by similarity/data statistics. Intra-group averaging reduces local non-IID effects; group heads then synchronize at coarser intervals (Ni et al., 27 Mar 2025, Liu et al., 27 Mar 2025).
  • Multi-level tree topologies accommodate network tiers: device → edge aggregator → central node.

Asynchronous Federated Fine-Tuning

  • Clients operate on arbitrary schedules, with server-side "staleness-aware" aggregation rules (e.g., weighting updates by age) (Ni et al., 27 Mar 2025).

Split Federated Fine-Tuning

Personalization and Model Mixtures

Emergent One-Shot Aggregation

  • For large foundation models, one communication round of federated fine-tuning suffices to match multi-round convergence, due to smooth loss landscapes and small update magnitudes (Wang et al., 5 Dec 2024).

3. Parameter-Efficient Federated Fine-Tuning Techniques

Low-Rank Adaptation (LoRA) and Its Variants

  • LoRA injects trainable low-rank adapters (A,B)(A,B) into weight matrices: W=W0+BAW = W_0 + BA. Only O(r(d+k))O(r(d+k)) parameters are updated, enabling dramatic bandwidth and compute savings (Babakniya et al., 2023, Liu et al., 28 Dec 2024).
  • Advanced LoRA-based systems disentangle the role of "direction" (global knowledge; averaged) and "magnitude" (local adaptation; personalized) in adapters for structured aggregation (Zhao et al., 13 Oct 2025).
  • SLoRA stagewise initialization improves convergence under severe non-IID data (Babakniya et al., 2023).
  • HierFedLoRA and LEGEND allocate adapter layers and rank based on group heterogeneity and device capabilities, incorporating multi-armed bandit algorithms or resource-aware scheduling (Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025).

Federated LoRA versus Prompt/Adapter Tuning

  • Prompt tuning (Learned input tokens) and small-layer adapters are also widely used; these minimize communication further but can be less effective in few-shot, highly non-IID scenarios (Chen et al., 2022).
  • Adapter masking and resource-adaptive depth control: Clients select which adapters/layers to activate depending on their resource and statistical context, as in FedHFT and LEGEND (Ilhan et al., 15 Oct 2025, Liu et al., 28 Dec 2024).

MoE-based Federated Fine-Tuning

  • Sparse Mixture-of-Experts adapters (FFT-MoE, FLUX) generalize LoRA by enabling client-specific expert routing, adaptive per-client capacity, and heterogeneity-aware auxiliary losses for balanced expert utilization (Hu et al., 26 Aug 2025, Chen et al., 26 Aug 2025).
  • Personalization is achieved via client-specific gating, dynamic expert selection, and personalized expert fusion.

Representation Fine-Tuning

  • FedReFT enables direct intervention at the hidden representation level via low-rank edit subspaces, with personalized aggregation (All-But-Me) for extreme parameter/comm efficiency (Siddika et al., 27 Aug 2025).

Proxy/Compressed Model Fine-Tuning

  • FedPFT constructs a highly compressed “sub-FM” via layer-wise saliency pruning, combined with pre-/in-FL distillation to maintain alignment of gradients with the full model (Peng et al., 17 Apr 2024).

4. Addressing Heterogeneity: Personalization, System, and Statistical

Personalized Federated Fine-Tuning

  • Algorithms such as FedALT and FedAMoLE explicitly separate local ("individual") and global ("rest-of-world") model components, with dynamic mixing and assignment based on input, client clusters, or task relevancy. This mitigates harmful interference and optimizes for client-specific objectives (Bian et al., 14 Mar 2025, Zhang et al., 28 Nov 2024).

Bi-Level or Mixture Model Aggregation

Resource Adaptivity

  • LEGEND and HierFedLoRA propose dynamic assignment of adapter depth/rank/grouping based on node capability and system constraints, optimizing depth–rank tradeoff for efficient convergence (Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025).

Wireless and Edge-Aware Fine-Tuning

5. Communication, Convergence, and Empirical Benchmarks

Communication cost is the central systems bottleneck:

  • For adapter/PEFT-based FT, round cost Cround=K×S×bC_{\text{round}} = K \times S \times b is dominated by the (reduced) adapter size SS and quantization bits bb (Ni et al., 27 Mar 2025).
  • Using LoRA+quantization, communication per round is reduced by orders of magnitude (e.g., O(106)O(10^6) parameters vs O(109)O(10^9)) and corroborated in practical benchmarks (Babakniya et al., 2023, Wang et al., 5 Dec 2024).
  • Hierarchical and clustered strategies further amortize cost, with cluster leads forwarding only aggregate deltas.

A representative selection of empirical findings:

Method Acc. Gain Over Baseline Comm. Reduction Convergence Speedup
FedHFT up to +2.7% 3–122× 2–3× faster
LEGEND up to +42% ~42% 1.5–2.8×
HierFedLoRA +1.6–4.2% up to 2.2× 2.1–4.3×
SLoRA matches full FT ~10–20× up to 90% less time
MoE/FFT-MoE up to +38% (severe NIID) case/dep., ~2×+ up to 12× over LoRA
DevFT +1.3–3.3% 10.7× 4.6×

(Hu et al., 26 Aug 2025, Yan et al., 8 Jan 2025, Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025, Babakniya et al., 2023)

One-shot federated fine-tuning achieves equivalent convergence to multi-round FL for >1B-parameter models, with T×T\times lower communication (Wang et al., 5 Dec 2024).

6. Challenges, Limitations, and Emerging Directions

Statistical Heterogeneity

System Heterogeneity

  • Adaptation to mixed fleets (IoT, mobile, server-class) demands robust scheduling, adaptive algorithmic depth, and continuous adjustment of resource allocation.

Scaling to Ultra-Large Models

  • Split models and "sub-FM" proxies reduce compute/memory but encounter drift and loss of fine-grained capacity. Multi-level distillation, adaptive pruning, and staged curriculum approaches (DevFT) show promise (Peng et al., 17 Apr 2024, Wu et al., 31 Jul 2025).

Multi-Modal and Cross-Device Collaboration

  • Extension to vision-language and cross-modal FMs requires benchmarking pipelines (e.g., FedVLMBench), dataset diversity, and novel aggregation strategies (Zheng et al., 11 Jun 2025).

Privacy, Security, and Unlearning

Automation and Meta-Learning

  • Automated hyperparameter optimization (e.g., adapter rank, depth, expert count), meta-learning for client clustering/routing, and dynamic task models remain open, as does large-scale standardization.

7. Summary Table of Key Federated Fine-Tuning Algorithmic Variants

Variant Local Update Aggregation Personalization Heterogeneity Mitigation
FedAvg Full/PEFT-SGD Weighted Average None None
FedLoRA LoRA adapters FedAvg on adapters None
HierFedLoRA LoRA Two-level (group & global) Device-aware group config Grouping, dynamic frequency
LEGEND LoRA, adaptive rank FedAvg on variable adapters Device-adaptive load Layer-rank assignment
FedHFT Masked adapters Mixture-of-clusters + SVD Cluster soft-assignment Clustering, Fisher masking
FFT-MoE Sparse MoE adapters FedAvg + expert balancing loss Per-client Top-K gating Input-dependent routing
DevFT Layer fusion Stagewise FedAvg Staged knowledge transfer Progressive curriculum
FedALT Disjoint LoRAs Per-client RoW + Indiv. mix Input-specific gating Separate aggregation/mixing
FedAMoLE LoRA MoE modules Shared/router/expert avg Data-driven expert assign Reverse expert selection
SFPrompt/Split-FL Prompt + split model Activation/grad exchange Local prompt only Trunk offloaded to server
FedPFT Compressed sub-FM Layer/Neuron distillation Sub-FM personalization 2-stage knowledge alignment
FedReFT Representation edit All-But-Me aggregation Geometric median + blend Semantics-aware fusion

This table structures principal methods found in (Hu et al., 26 Aug 2025, Ilhan et al., 15 Oct 2025, Bian et al., 14 Mar 2025, Babakniya et al., 2023, Wang et al., 5 Dec 2024, Liu et al., 27 Mar 2025, Liu et al., 28 Dec 2024, Yan et al., 8 Jan 2025, Zhang et al., 28 Nov 2024, Cao et al., 24 Jul 2024, Peng et al., 17 Apr 2024, Siddika et al., 27 Aug 2025).


Federated fine-tuning is a critical enabler of private, large-scale foundation model adaptation. Modern approaches span from simple FedAvg on PEFT modules to sophisticated mixtures of experts, masked adapters, client clustering, asynchronous aggregation, dynamic scheduling, staged curricula, and representation-level optimization, reflecting the evolving landscape of distributed AI under both practical and statistical heterogeneity constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Federated Fine-Tuning.