Stacking-Based Aggregation (FLoRA)

Updated 24 November 2025

Stacking-based aggregation is a method that precisely concatenates block-disjoint low-rank client updates to eliminate cross-term noise in federated learning.
It forms stacked matrices from individual client updates, supporting heterogeneous adapter ranks without the need for zero-padding or rigid constraints.
Empirical results show FLoRA improves fine-tuning and hyperparameter optimization efficiency while reducing communication overhead and scaling to many clients.

Stacking-based aggregation, as instantiated in the FLoRA method, refers to mathematically precise matrix stacking strategies for federated aggregation of heterogeneous low-rank model updates, particularly in LLM fine-tuning and federated hyperparameter optimization. This approach addresses key aggregation noise and rank heterogeneity issues in previous federated learning (FL) protocols, providing noise-free, scalable, and efficient update composition across diverse clients. The core principle is the elimination of cross-term aggregation error by concatenating and summing block-disjoint client updates, ensuring faithful and resource-appropriate federated model improvement. The stacking-based paradigm underpins two prominent works: FLoRA for federated LLM fine-tuning with arbitrary low-rank adapters (Wang et al., 2024) and FLoRA for single-shot federated hyperparameter optimization via surrogate regression stacking (Zhou et al., 2021).

1. Federated Fine-Tuning and the Aggregation Challenge

Federated fine-tuning of LLMs involves $K$ clients, each accessing a shared frozen model $W\in \mathbb{R}^{m\times n}$ . Clients train local low-rank adapters $\Delta W_k= B_kA_k$ (with $A_k\in \mathbb{R}^{r_k\times n}$ , $B_k\in \mathbb{R}^{m\times r_k}$ , $r_k\ll \min(m,n)$ ), reflecting individual data and resource profiles. The server’s objective is to aggregate these $\Delta W_k$ into a unified global update $\Delta W$ . Traditional approaches (notably FedAvg-LoRA/FedIT) average $A_k$ and $B_k$ independently and compute the product, leading to

$\Delta W = \left(\sum_{k}p_k B_k\right)\left(\sum_j p_j A_j\right)$

which expands to include cross-terms $B_kA_j$ ( $k\ne j$ ), introducing "aggregation noise." This noise not only corrupts the desired weighted sum $\sum_k p_k\Delta W_k$ but enforces a rigid constraint that all $r_k$ be identical—a poor fit for heterogeneous client capability (Wang et al., 2024).

2. Stacking-Based Aggregation: Mathematical Principles

Stacking-based aggregation avoids cross-terms and supports arbitrary per-client ranks via direct blockwise concatenation. Given local adapters $\{(A_k,B_k)\}_{k=1...K}$ , construct

$A_{\mathrm{stack}} = \begin{bmatrix}A_1; A_2; ...; A_K\end{bmatrix} \in \mathbb{R}^{(\sum_k r_k)\times n}$ (vertical stack of $A_k$ )
$B_{\mathrm{stack}} = [B_1, B_2, ..., B_K] \in \mathbb{R}^{m \times (\sum_k r_k)}$ (horizontal stack of $B_k$ )

The global update is then

$\Delta W_{\mathrm{global}} = B_{\mathrm{stack}} A_{\mathrm{stack}}$

which, due to block-disjoint structure, reduces precisely to $\sum_k B_k A_k$ . Weighting of client contributions is handled by scaling matrices before stacking, i.e., $A_{\mathrm{stack}} = [p_1 A_1; ...; p_K A_K]$ yields $\Delta W_{\mathrm{global}} = \sum_k p_k B_kA_k$ . No zero-padding or block-diagonal encoding is necessary, and heterogeneous adapter ranks are seamlessly accommodated.

3. Aggregation Algorithm and Workflow

A single FLoRA round is characterized by the following protocol:

Server:

Broadcasts frozen global $W$ to clients.
Receives $\{A_k, B_k\}$ from each client $k$ .
Forms $A_{\mathrm{stack}}$ (vertically concatenated, with client scaling $p_k$ ) and $B_{\mathrm{stack}}$ (concatenated horizontally).
Computes $\Delta W_{\mathrm{global}} = B_{\mathrm{stack}}A_{\mathrm{stack}}$ .
Distributes $(A_{\mathrm{stack}}, B_{\mathrm{stack}})$ to all clients for update integration.

Client $k$ :

Initializes LoRA module $(A_k, B_k)$ with local $r_k$ .
Fine-tunes locally for $E$ epochs (keeping $W$ frozen).
Sends $(A_k, B_k)$ to server, awaits $(A_{\mathrm{stack}}, B_{\mathrm{stack}})$ .
Updates local model by adding $\Delta W_{\mathrm{global}}$ to $W$ .

This workflow is preserved across rounds, supports arbitrary client configuration, and is communication- and computation-efficient (Wang et al., 2024).

4. Theoretical Properties and Correctness

The stacking method’s correctness follows from linearity and the mutual orthogonality of block partitions in $A_{\mathrm{stack}}$ and $B_{\mathrm{stack}}$ . Specifically:

Each $A_k$ occupies distinct row ranges, $B_k$ distinct column ranges; off-diagonal products vanish.
Weighted stacking ( $p_k$ scaling) produces exactly the intended $\sum_k p_k B_k A_k$ aggregation.
No quadratic terms ( $p_k^2$ ) or aggregation noise from cross-terms appears.
No information from any client is lost, and each is embedded in a unique submatrix.

The block-matrix view formalizes that $B_{\mathrm{stack}}A_{\mathrm{stack}}$ sums only the correct local updates, bypassing constraints and inaccuracies inherent in previous federated LoRA aggregation schemes.

5. Empirical Evaluation and Results

Experiments on MMLU (QA), MT-bench (chat), and standard LLM backbones (TinyLlama-1.1B, Llama-7B) demonstrate that stacking-based FLoRA outperforms baseline FedIT (FedAvg-LoRA) in both homogeneous and heterogeneous rank configurations:

On MMLU-Dolly with TinyLlama-1.1B: FedIT achieves $16.35\%$ , FLoRA reaches $30.80\%$ .
On TinyLlama (MT-bench): FedIT $2.92$, FLoRA $3.13$.
Llama-7B shows consistent improvements of $+1$ –$2$ points.
Heterogeneous ranks ([64,32,16,8,4,...]): FedIT with zero-padding degrades (MMLU-Alpaca $\approx8\%$ ), while FLoRA maintains high performance (high-20’s to low-30’s on MMLU, $3.1$–$4.2$ on MT-bench).
FLoRA+AdaLoRA demonstrates further reduction of total rank budget (from $160\rightarrow120$ ) with negligible accuracy loss.
Scaling factor $p_k$ has no universal optimum; optimality is dataset- and model-dependent ($0.01$–$0.2$ explored) (Wang et al., 2024).

Global models consistently outperform any constituent local model across all ablation studies, and in some tasks stacking-based aggregation even slightly outperforms centralized LoRA, plausibly due to decreased overfitting from better-regularized aggregation.

6. Communication, Computation, and Scalability Considerations

FLoRA's stacking-based setup imposes only marginal overhead:

Each round transmits $O(\sum_k r_k(m+n))$ elements, a fraction of $O(|W|)$ for full model transfer.
Over three rounds, FLoRA sends $5$– $8\times$ fewer bytes than full fine-tuning, only $10$– $20\%$ more than FedIT.
The stacking operation is $O(m\sum_k r_k + n\sum_k r_k)$ and negligible in the context of LLM computation.
FLoRA scales to $K>100$ clients and arbitrary $r_k$ values without modification, and is compatible with secure aggregation, encryption, and differential privacy, as only adapters are transmitted (Wang et al., 2024).

7. Stacking-Based Aggregation in Federated Hyperparameter Optimization

A parallel application of stacking-based aggregation appears in "FLoRA: Single-shot Hyper-parameter Optimization for Federated Learning" (Zhou et al., 2021). Here, the stacking construct is used for surrogate loss surface aggregation in federated HPO:

Each client $i$ locally fits a regressor $f^{(i)}:\Lambda\to\mathbb{R}$ (e.g., random forest, GP) to observed $(\lambda, \text{loss})$ pairs.
The aggregator combines these via four possible strategies, one of which—APLM ("average of per-client models")—is a stacking-style ensemble: $\ell_{\mathrm{agg}}(\lambda) = \frac{1}{p} \sum_{i=1}^p f^{(i)}(\lambda)$ .
The aggregated surrogate guides a global hyperparameter choice in a single communication round, minimizing overhead and achieving low regret and robust performance as $p$ grows.
Empirical results on gradient-boosted trees and neural networks validate stacking’s effectiveness and communication efficiency in federated HPO (Zhou et al., 2021).

Summary

Stacking-based aggregation, as developed in FLoRA, constitutes a mathematically rigorous and resource-aware solution to federated aggregation of heterogeneous low-rank adaptations and local surrogate models. By precisely partitioning and summing blockwise contributions, stacking eliminates aggregation noise, enables flexible per-client participation, and achieves superior communication and computational efficiency. Its principles are central both to federated LLM fine-tuning with LoRA adapters (Wang et al., 2024) and to efficient single-shot federated HPO via ensemble surrogates (Zhou et al., 2021), marking a significant advancement in scalable and heterogeneous federated learning.

Markdown Upgrade to Chat

References (2)

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations (2024)

FLoRA: Single-shot Hyper-parameter Optimization for Federated Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stacking-Based Aggregation (FLoRA).