Papers
Topics
Authors
Recent
2000 character limit reached

Stacking-Based Aggregation (FLoRA)

Updated 24 November 2025
  • Stacking-based aggregation is a method that precisely concatenates block-disjoint low-rank client updates to eliminate cross-term noise in federated learning.
  • It forms stacked matrices from individual client updates, supporting heterogeneous adapter ranks without the need for zero-padding or rigid constraints.
  • Empirical results show FLoRA improves fine-tuning and hyperparameter optimization efficiency while reducing communication overhead and scaling to many clients.

Stacking-based aggregation, as instantiated in the FLoRA method, refers to mathematically precise matrix stacking strategies for federated aggregation of heterogeneous low-rank model updates, particularly in LLM fine-tuning and federated hyperparameter optimization. This approach addresses key aggregation noise and rank heterogeneity issues in previous federated learning (FL) protocols, providing noise-free, scalable, and efficient update composition across diverse clients. The core principle is the elimination of cross-term aggregation error by concatenating and summing block-disjoint client updates, ensuring faithful and resource-appropriate federated model improvement. The stacking-based paradigm underpins two prominent works: FLoRA for federated LLM fine-tuning with arbitrary low-rank adapters (Wang et al., 9 Sep 2024) and FLoRA for single-shot federated hyperparameter optimization via surrogate regression stacking (Zhou et al., 2021).

1. Federated Fine-Tuning and the Aggregation Challenge

Federated fine-tuning of LLMs involves KK clients, each accessing a shared frozen model WRm×nW\in \mathbb{R}^{m\times n}. Clients train local low-rank adapters ΔWk=BkAk\Delta W_k= B_kA_k (with AkRrk×nA_k\in \mathbb{R}^{r_k\times n}, BkRm×rkB_k\in \mathbb{R}^{m\times r_k}, rkmin(m,n)r_k\ll \min(m,n)), reflecting individual data and resource profiles. The server’s objective is to aggregate these ΔWk\Delta W_k into a unified global update ΔW\Delta W. Traditional approaches (notably FedAvg-LoRA/FedIT) average AkA_k and BkB_k independently and compute the product, leading to

ΔW=(kpkBk)(jpjAj)\Delta W = \left(\sum_{k}p_k B_k\right)\left(\sum_j p_j A_j\right)

which expands to include cross-terms BkAjB_kA_j (kjk\ne j), introducing "aggregation noise." This noise not only corrupts the desired weighted sum kpkΔWk\sum_k p_k\Delta W_k but enforces a rigid constraint that all rkr_k be identical—a poor fit for heterogeneous client capability (Wang et al., 9 Sep 2024).

2. Stacking-Based Aggregation: Mathematical Principles

Stacking-based aggregation avoids cross-terms and supports arbitrary per-client ranks via direct blockwise concatenation. Given local adapters {(Ak,Bk)}k=1...K\{(A_k,B_k)\}_{k=1...K}, construct

  • Astack=[A1;A2;...;AK]R(krk)×nA_{\mathrm{stack}} = \begin{bmatrix}A_1; A_2; ...; A_K\end{bmatrix} \in \mathbb{R}^{(\sum_k r_k)\times n} (vertical stack of AkA_k)
  • Bstack=[B1,B2,...,BK]Rm×(krk)B_{\mathrm{stack}} = [B_1, B_2, ..., B_K] \in \mathbb{R}^{m \times (\sum_k r_k)} (horizontal stack of BkB_k)

The global update is then

ΔWglobal=BstackAstack\Delta W_{\mathrm{global}} = B_{\mathrm{stack}} A_{\mathrm{stack}}

which, due to block-disjoint structure, reduces precisely to kBkAk\sum_k B_k A_k. Weighting of client contributions is handled by scaling matrices before stacking, i.e., Astack=[p1A1;...;pKAK]A_{\mathrm{stack}} = [p_1 A_1; ...; p_K A_K] yields ΔWglobal=kpkBkAk\Delta W_{\mathrm{global}} = \sum_k p_k B_kA_k. No zero-padding or block-diagonal encoding is necessary, and heterogeneous adapter ranks are seamlessly accommodated.

3. Aggregation Algorithm and Workflow

A single FLoRA round is characterized by the following protocol:

Server:

  1. Broadcasts frozen global WW to clients.
  2. Receives {Ak,Bk}\{A_k, B_k\} from each client kk.
  3. Forms AstackA_{\mathrm{stack}} (vertically concatenated, with client scaling pkp_k) and BstackB_{\mathrm{stack}} (concatenated horizontally).
  4. Computes ΔWglobal=BstackAstack\Delta W_{\mathrm{global}} = B_{\mathrm{stack}}A_{\mathrm{stack}}.
  5. Distributes (Astack,Bstack)(A_{\mathrm{stack}}, B_{\mathrm{stack}}) to all clients for update integration.

Client kk:

  1. Initializes LoRA module (Ak,Bk)(A_k, B_k) with local rkr_k.
  2. Fine-tunes locally for EE epochs (keeping WW frozen).
  3. Sends (Ak,Bk)(A_k, B_k) to server, awaits (Astack,Bstack)(A_{\mathrm{stack}}, B_{\mathrm{stack}}).
  4. Updates local model by adding ΔWglobal\Delta W_{\mathrm{global}} to WW.

This workflow is preserved across rounds, supports arbitrary client configuration, and is communication- and computation-efficient (Wang et al., 9 Sep 2024).

4. Theoretical Properties and Correctness

The stacking method’s correctness follows from linearity and the mutual orthogonality of block partitions in AstackA_{\mathrm{stack}} and BstackB_{\mathrm{stack}}. Specifically:

  • Each AkA_k occupies distinct row ranges, BkB_k distinct column ranges; off-diagonal products vanish.
  • Weighted stacking (pkp_k scaling) produces exactly the intended kpkBkAk\sum_k p_k B_k A_k aggregation.
  • No quadratic terms (pk2p_k^2) or aggregation noise from cross-terms appears.
  • No information from any client is lost, and each is embedded in a unique submatrix.

The block-matrix view formalizes that BstackAstackB_{\mathrm{stack}}A_{\mathrm{stack}} sums only the correct local updates, bypassing constraints and inaccuracies inherent in previous federated LoRA aggregation schemes.

5. Empirical Evaluation and Results

Experiments on MMLU (QA), MT-bench (chat), and standard LLM backbones (TinyLlama-1.1B, Llama-7B) demonstrate that stacking-based FLoRA outperforms baseline FedIT (FedAvg-LoRA) in both homogeneous and heterogeneous rank configurations:

  • On MMLU-Dolly with TinyLlama-1.1B: FedIT achieves 16.35%16.35\%, FLoRA reaches 30.80%30.80\%.
  • On TinyLlama (MT-bench): FedIT $2.92$, FLoRA $3.13$.
  • Llama-7B shows consistent improvements of +1+1–$2$ points.
  • Heterogeneous ranks ([64,32,16,8,4,...]): FedIT with zero-padding degrades (MMLU-Alpaca 8%\approx8\%), while FLoRA maintains high performance (high-20’s to low-30’s on MMLU, $3.1$–$4.2$ on MT-bench).
  • FLoRA+AdaLoRA demonstrates further reduction of total rank budget (from 160120160\rightarrow120) with negligible accuracy loss.
  • Scaling factor pkp_k has no universal optimum; optimality is dataset- and model-dependent ($0.01$–$0.2$ explored) (Wang et al., 9 Sep 2024).

Global models consistently outperform any constituent local model across all ablation studies, and in some tasks stacking-based aggregation even slightly outperforms centralized LoRA, plausibly due to decreased overfitting from better-regularized aggregation.

6. Communication, Computation, and Scalability Considerations

FLoRA's stacking-based setup imposes only marginal overhead:

  • Each round transmits O(krk(m+n))O(\sum_k r_k(m+n)) elements, a fraction of O(W)O(|W|) for full model transfer.
  • Over three rounds, FLoRA sends $5$–8×8\times fewer bytes than full fine-tuning, only $10$–20%20\% more than FedIT.
  • The stacking operation is O(mkrk+nkrk)O(m\sum_k r_k + n\sum_k r_k) and negligible in the context of LLM computation.
  • FLoRA scales to K>100K>100 clients and arbitrary rkr_k values without modification, and is compatible with secure aggregation, encryption, and differential privacy, as only adapters are transmitted (Wang et al., 9 Sep 2024).

7. Stacking-Based Aggregation in Federated Hyperparameter Optimization

A parallel application of stacking-based aggregation appears in "FLoRA: Single-shot Hyper-parameter Optimization for Federated Learning" (Zhou et al., 2021). Here, the stacking construct is used for surrogate loss surface aggregation in federated HPO:

  • Each client ii locally fits a regressor f(i):ΛRf^{(i)}:\Lambda\to\mathbb{R} (e.g., random forest, GP) to observed (λ,loss)(\lambda, \text{loss}) pairs.
  • The aggregator combines these via four possible strategies, one of which—APLM ("average of per-client models")—is a stacking-style ensemble: agg(λ)=1pi=1pf(i)(λ)\ell_{\mathrm{agg}}(\lambda) = \frac{1}{p} \sum_{i=1}^p f^{(i)}(\lambda).
  • The aggregated surrogate guides a global hyperparameter choice in a single communication round, minimizing overhead and achieving low regret and robust performance as pp grows.
  • Empirical results on gradient-boosted trees and neural networks validate stacking’s effectiveness and communication efficiency in federated HPO (Zhou et al., 2021).

Summary

Stacking-based aggregation, as developed in FLoRA, constitutes a mathematically rigorous and resource-aware solution to federated aggregation of heterogeneous low-rank adaptations and local surrogate models. By precisely partitioning and summing blockwise contributions, stacking eliminates aggregation noise, enables flexible per-client participation, and achieves superior communication and computational efficiency. Its principles are central both to federated LLM fine-tuning with LoRA adapters (Wang et al., 9 Sep 2024) and to efficient single-shot federated HPO via ensemble surrogates (Zhou et al., 2021), marking a significant advancement in scalable and heterogeneous federated learning.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Stacking-Based Aggregation (FLoRA).