FLAME: Towards Federated Fine-Tuning Large Language Models Through Adaptive SMoE (2506.16600v1)

Published 19 Jun 2025 in cs.LG and cs.AI

Abstract: Existing resource-adaptive LoRA federated fine-tuning methods enable clients to fine-tune models using compressed versions of global LoRA matrices, in order to accommodate various compute resources across clients. This compression requirement will lead to suboptimal performance due to information loss. To address this, we propose FLAME, a novel federated learning framework based on the Sparse Mixture-of-Experts (SMoE) architecture. Unlike prior approaches, FLAME retains full (uncompressed) global LoRA matrices and achieves client-side adaptability by varying the number of activated experts per client. However, incorporating SMoE into federated learning introduces unique challenges, specifically, the mismatch in output magnitude from partial expert activation and the imbalance in expert training quality across clients. FLAME tackles these challenges through a lightweight rescaling mechanism and an activation-aware aggregation scheme. Empirical results across diverse computational settings demonstrate that FLAME consistently outperforms existing methods, providing a robust and effective solution for resource-adaptive federated learning.

Summary

The paper introduces FLAME, which leverages Sparse Mixture-of-Experts to achieve true resource adaptivity by reducing per-client FLOPs without degrading model capacity.
It deploys a novel learnable rescaling mechanism and activation-aware aggregation to align outputs and stabilize federated updates across heterogeneous clients.
Empirical results demonstrate that FLAME consistently outperforms traditional LoRA-based methods, especially in low-resource regimes and scenarios with client sampling variations.

FLAME: Federated Fine-Tuning of LLMs via Adaptive Sparse Mixture-of-Experts

FLAME introduces a federated learning framework for fine-tuning LLMs that overcomes key limitations of existing resource-adaptive LoRA-based federated algorithms. The framework leverages a Sparse Mixture-of-Experts (SMoE) architecture and introduces mechanisms for adapting client computation, addressing output heterogeneity due to expert sparsity, and ensuring quality aggregation of expert updates—a combination not previously achieved in federated LLM adaptation.

Core Contributions and Motivation

The authors identify that current federated LoRA approaches, designed for client heterogeneity, compress global LoRA matrices (e.g., by reducing rank via SVD) before distribution, thereby introducing information loss and limiting downstream performance. Critically, empirical analysis shows that LoRA rank reduction does not meaningfully reduce client FLOPs, because the dominant cost remains in the frozen base model forward pass, not the adaptation modules. Thus, these approaches fail to provide true resource adaptivity.

FLAME addresses this with three key innovations:

Resource Adaptivity via SMoE: Instead of compressing LoRA matrices, FLAME keeps the global LoRA rank unchanged and achieves workload adaptation by varying the number of activated experts per client in SMoE layers. This approach enables direct, substantial reductions in per-client computation without degrading representation capacity.
Lightweight Rescaling Mechanism: Since activating fewer experts than in full-capacity mode changes output magnitude, FLAME includes a learnable affine rescaler for each client, trained from scratch, to realign partial expert outputs. This mechanism aligns the outputs irrespective of variable expert participation and generalizes better than static multiplicative scaling.
Activation-Aware Aggregation: Expert frequency imbalance across clients (in both data distribution and compute allocation) can result in poorly aggregated global experts. FLAME introduces an aggregation scheme in which each client's contribution to an expert's LoRA update is weighted by the product of its dataset size and the expert’s activation frequency (with a temperature parameter). This design prevents updates from rarely activated local experts from contaminating the global model and properly reflects individual expert quality.

Detailed Method

Federated SMoE Fine-Tuning: Each client receives the full-rank global LoRA matrices for all SMoE experts and locally fine-tunes using $k_i$ (client-specific) activated experts per forward pass, as chosen by a routing mechanism (e.g., TopK over routing probabilities). The remaining experts are frozen for that client.
Rescaling: Outputs from SMoE layers with fewer experts are corrected by a scalar $s_i$ , learned per client during local fine-tuning.
Aggregation: For each expert $j$ and client $i$ , the weight for aggregation is $\gamma_i^j = (\frac{a_i^j}{S_i})^t \cdot |D_i|$ , where $a_i^j$ is the number of times expert $j$ was activated in $S_i$ steps, $|D_i|$ the dataset size, and $t$ a tunable temperature. This mutation of federated averaging stabilizes the global model and aligns updates with actual usage and training quality.

Empirical Evidence

Comprehensive experiments across LLM architectures (dense and SMoE-based), datasets (AlpaGasus and Dolly), and simulation of diverse client budgets show several robust findings:

True Resource Adaptation: Halving the number of active experts per client in SMoE cuts FLOPs by >50%, whereas LoRA rank compression yields only a 1–2% reduction.
Superior Downstream Performance: Under all resource budgets and data heterogeneity scenarios, FLAME consistently outperforms both standard LoRA, rank-compressed federated LoRA (e.g., HLoRA, FlexLoRA), and trivial baselines. Notably, in low-resource regimes, FLAME's margin over baselines is substantial (often >10 points by reported scores).
Client Population Scalability: With 40 clients, FLAME's advantage remains or widens, indicating robust scaling and stability in practical federated environments.
Resilience to Client Sampling: When only 25–50% of clients participate per round (to mimic real-world asynchrony), FLAME's performance degrades gracefully and is impacted less than other methods.
Ablations: Both the learnable rescaler and the activation-aware aggregation scheme provide measurable benefits over simpler alternatives (static scaling, vanilla averaging).
Aggregation Temperature Effect: Higher temperature values (up to $t=4$ or $t=8$ ) further improve performance under high data/expert heterogeneity, validating the aggregation approach.

Practical and Theoretical Implications

Practical implications of FLAME include:

Realistic Device Adaptation: Mobile and resource-constrained devices can participate in federated LLM fine-tuning without the need to compress adaptation matrices or be excluded due to computational budget mismatches. Expert reduction fully controls local FLOPs.
Data-Activity Aligned Aggregation: Updates reflect actual training intensity per expert—critical in non-i.i.d., skewed, or stratified federated regimes.
Improved Deployment Efficiency: At inference, the sparse expert configuration can also be leveraged, retaining deployment gains obtained during training.

Theoretical implications are:

Federated Averaging Generalization: The activation-aware weighting scheme extends classical FedAvg, introducing aggregation that reflects both compute and data heterogeneity, and draws connections to importance sampling and dynamic federated aggregation.
SMoE Federated Optimization: Demonstrates that federated methods specifically designed for sparse, modular foundation architectures are essential, as naive per-parameter or per-module strategies are insufficient.

Limitations and Future Directions

Limitations include a focus on SMoE architectures, though the trend toward sparse expert architectures (as seen in recent LLMs) makes this choice timely. Evaluation is limited to medium-scale SMoE models due to resource constraints; scaling to trillion-parameter models is an open engineering frontier.

Potential avenues for future work:

Broader SMoE Architectures: Extending to more hierarchical or dynamic routing mechanisms.
Other Adaptation Modules: Generalizing to parameter-efficient adaptation strategies beyond LoRA.
Client Scheduling and Privacy: Integrating with privacy-preserving mechanisms or adapting to highly dynamic/ephemeral edge clients.

Conclusion

FLAME establishes that resource adaptation in federated fine-tuning of LLMs is best addressed by architecturally-leveraged strategies (expert selection) rather than lossy compression (LoRA rank reduction). Its robust empirical superiority across challenging scenarios and its modular, scalable implementation approach make it a strong candidate for practical deployment in privacy-minded, resource-diverse environments as LLMs continue their shift to SMoE backbones.