Federated Fine-Tuning
- Federated fine-tuning is a distributed adaptation technique where edge clients collaboratively fine-tune large pre-trained models locally without sharing raw data.
- It employs methods like FedAvg and parameter-efficient tuning (e.g., LoRA) to overcome challenges posed by non-IID data and resource heterogeneity.
- The approach prioritizes communication efficiency, privacy, and practical deployment across devices, making it key for scalable, foundation model adaptation.
Federated fine-tuning is a model adaptation paradigm that enables distributed edge clients to collaboratively tailor large pre-trained models—such as transformers, LLMs, or multimodal architectures—to downstream tasks without sharing raw data. This workflow is motivated by strict privacy requirements, resource constraints, and increasing model parameterization. Over the past several years, federated fine-tuning (or "FedFT") and its variants have become central to privacy-preserving ML, supporting deployment of foundation models across heterogeneous environments such as mobile devices, enterprises, and wireless sensor networks.
1. Methodological Foundations and Objectives
Federated fine-tuning seeks to adapt large pre-trained models on client data that remains local, typically through rounds of collaborative optimization coordinated by a centralized or hierarchical server. The general objective is
where each client with private data minimizes its own loss and weights by data size. This structure is preserved for both full-model and parameter-efficient fine-tuning schemes (e.g., LoRA).
The defining constraints in federated fine-tuning are:
- Data privacy: No raw data or feature traces are uploaded to any server.
- Resource heterogeneity: Clients may possess widely varying compute/storage capabilities.
- Non-IID data distributions: Client datasets may be highly skewed, impeding naive aggregation.
- Communication efficiency: Large foundation models have billions of parameters, making direct exchange impractical.
Fine-tuning thus demands algorithms that (i) personalize global models to account for local data, (ii) are lightweight enough for resource-constrained devices, and (iii) control statistical and system heterogeneity (Ni et al., 27 Mar 2025, Liu et al., 28 Dec 2024, Yan et al., 8 Jan 2025).
2. Core Algorithms and System Architectures
A variety of communication and aggregation schemas have been proposed:
Standard (FedAvg-based) Federated Fine-Tuning
- Each round : Server broadcasts current model , clients perform local optimization (typically SGD or Adam) for epochs, and send parameter/model deltas to the server.
- Server aggregates (usually weighted average).
- Limitation: Cross-client interference under data heterogeneity causes model drift and slow convergence.
Parameter-Efficient Federated Fine-Tuning (PEFT)
- Federated LoRA: Only low-rank matrices (adapters) are trainable. FedAvg aggregation is performed on the LoRA blocks, while the main backbone remains frozen (Babakniya et al., 2023, Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025, Zhao et al., 13 Oct 2025).
Hierarchical and Clustered Aggregations
- Devices are grouped by similarity/data statistics. Intra-group averaging reduces local non-IID effects; group heads then synchronize at coarser intervals (Ni et al., 27 Mar 2025, Liu et al., 27 Mar 2025).
- Multi-level tree topologies accommodate network tiers: device → edge aggregator → central node.
Asynchronous Federated Fine-Tuning
- Clients operate on arbitrary schedules, with server-side "staleness-aware" aggregation rules (e.g., weighting updates by age) (Ni et al., 27 Mar 2025).
Split Federated Fine-Tuning
- Model is partitioned, e.g., embedding/trunk at client and encoder/head at server. This reduces local memory/computation and bandwidth (only intermediate activations/gradients exchanged) (Cao et al., 24 Jul 2024, Wang et al., 3 Jul 2024, Yan et al., 8 Jan 2025).
Personalization and Model Mixtures
- Mechanisms such as bi-level adapters (Ilhan et al., 15 Oct 2025), client-specific gate/mixer networks (Bian et al., 14 Mar 2025), and mixture-of-experts or expert assignment (Hu et al., 26 Aug 2025, Zhang et al., 28 Nov 2024) further decouple local and global knowledge for fine-grained adaptation.
Emergent One-Shot Aggregation
- For large foundation models, one communication round of federated fine-tuning suffices to match multi-round convergence, due to smooth loss landscapes and small update magnitudes (Wang et al., 5 Dec 2024).
3. Parameter-Efficient Federated Fine-Tuning Techniques
Low-Rank Adaptation (LoRA) and Its Variants
- LoRA injects trainable low-rank adapters into weight matrices: . Only parameters are updated, enabling dramatic bandwidth and compute savings (Babakniya et al., 2023, Liu et al., 28 Dec 2024).
- Advanced LoRA-based systems disentangle the role of "direction" (global knowledge; averaged) and "magnitude" (local adaptation; personalized) in adapters for structured aggregation (Zhao et al., 13 Oct 2025).
- SLoRA stagewise initialization improves convergence under severe non-IID data (Babakniya et al., 2023).
- HierFedLoRA and LEGEND allocate adapter layers and rank based on group heterogeneity and device capabilities, incorporating multi-armed bandit algorithms or resource-aware scheduling (Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025).
Federated LoRA versus Prompt/Adapter Tuning
- Prompt tuning (Learned input tokens) and small-layer adapters are also widely used; these minimize communication further but can be less effective in few-shot, highly non-IID scenarios (Chen et al., 2022).
- Adapter masking and resource-adaptive depth control: Clients select which adapters/layers to activate depending on their resource and statistical context, as in FedHFT and LEGEND (Ilhan et al., 15 Oct 2025, Liu et al., 28 Dec 2024).
MoE-based Federated Fine-Tuning
- Sparse Mixture-of-Experts adapters (FFT-MoE, FLUX) generalize LoRA by enabling client-specific expert routing, adaptive per-client capacity, and heterogeneity-aware auxiliary losses for balanced expert utilization (Hu et al., 26 Aug 2025, Chen et al., 26 Aug 2025).
- Personalization is achieved via client-specific gating, dynamic expert selection, and personalized expert fusion.
Representation Fine-Tuning
- FedReFT enables direct intervention at the hidden representation level via low-rank edit subspaces, with personalized aggregation (All-But-Me) for extreme parameter/comm efficiency (Siddika et al., 27 Aug 2025).
Proxy/Compressed Model Fine-Tuning
- FedPFT constructs a highly compressed “sub-FM” via layer-wise saliency pruning, combined with pre-/in-FL distillation to maintain alignment of gradients with the full model (Peng et al., 17 Apr 2024).
4. Addressing Heterogeneity: Personalization, System, and Statistical
Personalized Federated Fine-Tuning
- Algorithms such as FedALT and FedAMoLE explicitly separate local ("individual") and global ("rest-of-world") model components, with dynamic mixing and assignment based on input, client clusters, or task relevancy. This mitigates harmful interference and optimizes for client-specific objectives (Bian et al., 14 Mar 2025, Zhang et al., 28 Nov 2024).
Bi-Level or Mixture Model Aggregation
- Masked adapters, mixture-of-cluster adapters, and expert selection (Ilhan et al., 15 Oct 2025, Zhang et al., 28 Nov 2024, Hu et al., 26 Aug 2025) support fine-grained adaptation by partitioning aggregation into subspaces aligned to client clusters or data modes.
Resource Adaptivity
- LEGEND and HierFedLoRA propose dynamic assignment of adapter depth/rank/grouping based on node capability and system constraints, optimizing depth–rank tradeoff for efficient convergence (Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025).
Wireless and Edge-Aware Fine-Tuning
- Practical deployments must address downlink/uplink scheduling, quantization, client selection, bandwidth allocation, and dynamic power management (Wang et al., 3 Jul 2024, Wang et al., 5 Sep 2025, Ni et al., 27 Mar 2025).
- Online learning and optimization offer tractable solutions to non-convex scheduling and resource allocation while maintaining robust convergence bounds (Wang et al., 5 Sep 2025).
5. Communication, Convergence, and Empirical Benchmarks
Communication cost is the central systems bottleneck:
- For adapter/PEFT-based FT, round cost is dominated by the (reduced) adapter size and quantization bits (Ni et al., 27 Mar 2025).
- Using LoRA+quantization, communication per round is reduced by orders of magnitude (e.g., parameters vs ) and corroborated in practical benchmarks (Babakniya et al., 2023, Wang et al., 5 Dec 2024).
- Hierarchical and clustered strategies further amortize cost, with cluster leads forwarding only aggregate deltas.
A representative selection of empirical findings:
| Method | Acc. Gain Over Baseline | Comm. Reduction | Convergence Speedup |
|---|---|---|---|
| FedHFT | up to +2.7% | 3–122× | 2–3× faster |
| LEGEND | up to +42% | ~42% | 1.5–2.8× |
| HierFedLoRA | +1.6–4.2% | up to 2.2× | 2.1–4.3× |
| SLoRA | matches full FT | ~10–20× | up to 90% less time |
| MoE/FFT-MoE | up to +38% (severe NIID) | case/dep., ~2×+ | up to 12× over LoRA |
| DevFT | +1.3–3.3% | 10.7× | 4.6× |
(Hu et al., 26 Aug 2025, Yan et al., 8 Jan 2025, Liu et al., 28 Dec 2024, Liu et al., 27 Mar 2025, Babakniya et al., 2023)
One-shot federated fine-tuning achieves equivalent convergence to multi-round FL for >1B-parameter models, with lower communication (Wang et al., 5 Dec 2024).
6. Challenges, Limitations, and Emerging Directions
Statistical Heterogeneity
- Performance degrades sharply for standard FedAvg or naive LoRA+FedAvg under extreme non-IID data.
- Fine-grained aggregation (e.g., by directional/magnitude separation (Zhao et al., 13 Oct 2025), MoE (Hu et al., 26 Aug 2025), or All-But-Me (Siddika et al., 27 Aug 2025)) is essential for generalization.
System Heterogeneity
- Adaptation to mixed fleets (IoT, mobile, server-class) demands robust scheduling, adaptive algorithmic depth, and continuous adjustment of resource allocation.
Scaling to Ultra-Large Models
- Split models and "sub-FM" proxies reduce compute/memory but encounter drift and loss of fine-grained capacity. Multi-level distillation, adaptive pruning, and staged curriculum approaches (DevFT) show promise (Peng et al., 17 Apr 2024, Wu et al., 31 Jul 2025).
Multi-Modal and Cross-Device Collaboration
- Extension to vision-language and cross-modal FMs requires benchmarking pipelines (e.g., FedVLMBench), dataset diversity, and novel aggregation strategies (Zheng et al., 11 Jun 2025).
Privacy, Security, and Unlearning
- Further research is required on federated unlearning, differentially private optimizer variants, and attack resistance for token-level and adapter-level federated communication (Ni et al., 27 Mar 2025, Zhang et al., 28 Nov 2024).
Automation and Meta-Learning
- Automated hyperparameter optimization (e.g., adapter rank, depth, expert count), meta-learning for client clustering/routing, and dynamic task models remain open, as does large-scale standardization.
7. Summary Table of Key Federated Fine-Tuning Algorithmic Variants
| Variant | Local Update | Aggregation | Personalization | Heterogeneity Mitigation |
|---|---|---|---|---|
| FedAvg | Full/PEFT-SGD | Weighted Average | None | None |
| FedLoRA | LoRA adapters | FedAvg on adapters | None | — |
| HierFedLoRA | LoRA | Two-level (group & global) | Device-aware group config | Grouping, dynamic frequency |
| LEGEND | LoRA, adaptive rank | FedAvg on variable adapters | Device-adaptive load | Layer-rank assignment |
| FedHFT | Masked adapters | Mixture-of-clusters + SVD | Cluster soft-assignment | Clustering, Fisher masking |
| FFT-MoE | Sparse MoE adapters | FedAvg + expert balancing loss | Per-client Top-K gating | Input-dependent routing |
| DevFT | Layer fusion | Stagewise FedAvg | Staged knowledge transfer | Progressive curriculum |
| FedALT | Disjoint LoRAs | Per-client RoW + Indiv. mix | Input-specific gating | Separate aggregation/mixing |
| FedAMoLE | LoRA MoE modules | Shared/router/expert avg | Data-driven expert assign | Reverse expert selection |
| SFPrompt/Split-FL | Prompt + split model | Activation/grad exchange | Local prompt only | Trunk offloaded to server |
| FedPFT | Compressed sub-FM | Layer/Neuron distillation | Sub-FM personalization | 2-stage knowledge alignment |
| FedReFT | Representation edit | All-But-Me aggregation | Geometric median + blend | Semantics-aware fusion |
This table structures principal methods found in (Hu et al., 26 Aug 2025, Ilhan et al., 15 Oct 2025, Bian et al., 14 Mar 2025, Babakniya et al., 2023, Wang et al., 5 Dec 2024, Liu et al., 27 Mar 2025, Liu et al., 28 Dec 2024, Yan et al., 8 Jan 2025, Zhang et al., 28 Nov 2024, Cao et al., 24 Jul 2024, Peng et al., 17 Apr 2024, Siddika et al., 27 Aug 2025).
Federated fine-tuning is a critical enabler of private, large-scale foundation model adaptation. Modern approaches span from simple FedAvg on PEFT modules to sophisticated mixtures of experts, masked adapters, client clustering, asynchronous aggregation, dynamic scheduling, staged curricula, and representation-level optimization, reflecting the evolving landscape of distributed AI under both practical and statistical heterogeneity constraints.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free