Papers
Topics
Authors
Recent
Search
2000 character limit reached

FedGaLore: Adaptive Federated Optimization

Updated 9 February 2026
  • FedGaLore is a federated optimization framework that adapts client-side gradient subspaces to manage non-IID data distributions.
  • It employs a novel GaLore-style local optimization combined with server-side AJIVE for robust second-moment synchronization.
  • Empirical benchmarks show that FedGaLore achieves near full-tuning accuracy with significantly lower communication and storage overhead.

FedGaLore is a federated optimization framework that addresses the critical failure modes of Low-Rank Adaptation (LoRA) in data-heterogeneous federated learning, particularly under non-IID (non-identically distributed) client data scenarios. FedGaLore combines adaptive, gradient-subspace optimization (“GaLore”-style) on each client with a drift-robust server-side synchronization protocol for projected second-moment optimizer states, leveraging spectral shared-signal extraction via Angle-based JIVE (AJIVE). The methodological innovations and theoretical robustness guarantees of FedGaLore enable it to achieve near-full-tuning accuracy and stability with substantially reduced communication and storage overhead compared to full-parameter fine-tuning and existing LoRA-based approaches (Peng et al., 2 Feb 2026).

1. Federated Learning with Data Heterogeneity

In the federated setting, MM clients are each associated with local datasets $\D_i\sim\P_i$, leading to individual optimization objectives

$F_i(\theta) = \E_{(x,y)\sim\P_i}\big[\ell(\theta; x, y)\big]$

and a global weighted sum objective

f(θ)=i=1MpiFi(θ),ipi=1.f(\theta) = \sum_{i=1}^M p_i F_i(\theta),\quad \sum_i p_i = 1.

Data heterogeneity (ij\P_i \ne \P_j) drives divergence between client gradients Fi(θ)\nabla F_i(\theta) and the global gradient f(θ)\nabla f(\theta), resulting in “client drift.” This drift is a central challenge in federated optimization, especially for low-rank adapter methods where parameter constraints and optimizer state misalignments exacerbate aggregation instability and loss of robustness.

2. Limitations of Federated LoRA: Mismatches in Subspace and Optimizer State

2.1 Update-Space Mismatch

Traditional federated LoRA methods restrict each client’s update to a low-rank manifold

$\M_{\le r} = \{ \Delta W : \operatorname{rank}(\Delta W) \le r \} \subset \R^{d_{\rm out}\times d_{\rm in}}$

by updating only the LoRA factors (B,A)(B,A) with W=W(0)+BAW = W^{(0)} + B A for rmin(dout,din)r \ll \min(d_{\rm out}, d_{\rm in}). Client updates $\Delta W_i \in \M_{\le r}$ are aggregated in the server's full space, leading to

ΔWˉ=ip~iΔWi\Delta\bar W = \sum_i \tilde p_i \Delta W_i

which can have rank up to MrrMr \gg r. This aggregation often places the result far from any “stable tube” around $\M_{\le r}$—an effect magnified by large codimension as dictated by Weyl’s tube formula. Even small misalignments thus rapidly escalate, breaking the intended low-rank structure and compromising robustness.

2.2 Optimizer-State Mismatch

Federated LoRA implementations leveraging adaptive optimizers (e.g., AdamW) accumulate optimizer state—momentum (mt)(m_t) and second-moment (vt)(v_t) buffers—on each client. Due to heterogeneous gradients, these states drift across clients and rounds. The optimizer-state misalignment is analytically characterized: $\|\theta^{i,k}_t - \theta^{\star,k}_t\|_2 \le \Rloc(\delta) = \Rdrift(\delta) + \Rstate$ where $\Rdrift(\delta)$ arises from data drift and $\Rstate$ from initial state discrepancies. The amplification of second-moment mismatch BvB_v by the adaptive preconditioner (v+ϵ)1/2(v+\epsilon)^{-1/2} underscores the need for synchronization of optimizer statistics.

3. FedGaLore: Client-Side Adaptive GaLore Optimization

FedGaLore replaces LoRA’s fixed parameter subspaces with client-side GaLore-style, adaptive gradient subspaces. Each client, at each local step, constructs a rank-rr orthonormal projector PP (refreshed periodically with SVD or via a synchronized random seed), facilitating projection of the full local blockwise gradient GtG_t: G~t=GtPRn×r\tilde G_t = G_t P^\top \in \R^{n\times r} Projected first and second moments (m~t,v~t)(\tilde m_t, \tilde v_t) are maintained and updated in the reduced space via

m~t=β1m~t1+(1β1)G~t,v~t=β2v~t1+(1β2)(G~tG~t)\tilde m_{t} = \beta_1 \tilde m_{t-1} + (1-\beta_1)\tilde G_t, \qquad \tilde v_t = \beta_2 \tilde v_{t-1} + (1-\beta_2)(\tilde G_t \odot \tilde G_t)

The update, preconditioned and mapped back to ambient space,

Ut=(m~tv~t+ϵ)PU_t = (\tilde m_t \oslash \sqrt{\tilde v_t + \epsilon}) P

enables federated optimization in an adaptively chosen (not fixed) low-rank subspace aligned to current gradients. After initial local SVD-based projector refreshes, efficient seed-based random orthonormal projectors are used to avoid communication overhead.

Adaptive projection confers improved alignment: endpoints after local training are connected by flatter loss barriers, enhancing aggregation robustness compared to fixed-LoRA regimes.

4. Server-Side Drift-Robust Synchronization via Spectral Shared-Signal Extraction

To address optimizer-state drift, FedGaLore synchronizes only the projected second-moment buffer. Each client uploads its final projected second moment v~Ti,k\tilde v^{i,k}_T, which the server reconstructs to the full-dimensional second-moment view

Vi,k=v~Ti,kPkRn×nV^{i,k} = \tilde v^{i,k}_T P_k^\top \in \R^{n\times n}

of rank rr. These views are modeled as

Vi,k=Jk+Ai,k+Ei,kV^{i,k} = J^k + A^{i,k} + E^{i,k}

where JkJ^k is the sought-after low-rank joint preconditioner, Ai,kA^{i,k} is client-specific drift, and Ei,kE^{i,k} represents noise.

AJIVE (Angle-based Joint and Individual Variation Explained) is applied to extract the joint subspace across all clients' views, filtering out idiosyncratic and noisy components. The server broadcasts the resulting low-rank joint second-moment buffer vˉk+1\bar v_{k+1} as an initializer for the next round. This mechanism mitigates divergence due to asynchronous optimizer statistics and eliminates communication of full optimizer state.

5. Theoretical Robustness Guarantees

FedGaLore’s algorithmic design is supported by high-probability convergence guarantees. Under standard assumptions—LL-smoothness, PL-inequality, heterogeneity-boundedness, gradient norm clipping, and sub-Gaussian noise—the following local containment holds with probability at least 1δ1-\delta: $\|\theta^{i,k}_t-\theta^{\star,k}_t\|_2 \le \Rdrift(\delta) + \Rstate$ where, up to constants,

$\Rdrift(\delta)\lesssim \frac{\eta T}{\sqrt\epsilon}(G+\epsilon_{\rm noise}(\delta)),\qquad \Rstate \lesssim \eta \frac{B_m}{(1-\beta_1)\sqrt{\epsilon}} + \eta\frac{G\,B_v}{(1-\beta_2)\epsilon^{3/2}}$

Update-space “tube failure” is shown to be endemic under high codimension, inevitably driving the aggregate outside any practical low-rank neighborhood unless subspace alignment corrections (as in GaLore) are used.

Coupled with an aggregation-stability argument, this analysis yields 1δ1-\delta probability convergence to a bounded neighborhood (order $O(\Rloc)$) of the global PL solution, even under non-IID client distributions (Peng et al., 2 Feb 2026).

6. Algorithmic Details and Protocol Flow

FedGaLore operates in a classical federated round structure, outlined concisely as follows:

  1. The server broadcasts the global model θˉk\bar\theta_k and a synchronization seed sks_k to clients.
  2. Each client constructs the local projector P0iP_0^i from sks_k, initializes state (m~0i,v~0i)(\tilde m_0^i, \tilde v_0^i), and runs TT steps of GaLore-AdamW, periodically refreshing PtiP_t^i.
  3. At round end, the client returns the model update Δθi\Delta\theta_i and final projected second moment v~Ti\tilde v_T^i.
  4. The server applies weighted aggregation of Δθi\Delta\theta_i for the next model, reconstructs each client's second-moment view, and applies AJIVE to form the broadcasted consensus second-moment buffer.
  5. The process repeats for subsequent rounds, with all randomness and projection bases synchronized via broadcasted seeds.

A summary of the protocol’s main steps is given directly in the following algorithmic pseudocode excerpt:

1
2
3
4
5
\begin{algorithm}[H]
\caption{FedGaLore (one round %%%%41%%%%)}
\label{alg:fedgalore_summary}
...
\end{algorithm}

7. Empirical Evaluation and Benchmark Performance

FedGaLore is benchmarked on NLU (GLUE with RoBERTa-base), vision (DomainNet with ViT-base), and NLG (Llama-2-7B on MetaMathQA) tasks. Across settings—M=50M=50 clients with Dirichlet-non-IID (α=0.5\alpha=0.5), or M=4M=4 for LLM experiments—FedGaLore is compared to FedAvg-Full (full fine-tuning), FedIT, FFA-LoRA, FLoRA, FR-LoRA, and LoRA-Fair with task-matched hyperparameters.

The key metric, the non-IID vs IID performance drop Δ\Delta (averaged across GLUE tasks, DomainNet domains, or NLG datasets), is tabulated as follows:

Setting FedAvg-Full LoRA Baselines FedGaLore^- (no state sync) FedGaLore (full)
GLUE 1.5%\downarrow1.5\% 4-9%4\text{-}9\% 4%\downarrow4\% 1.2%\downarrow\mathbf{1.2\%}
DomainNet 4%\downarrow4\% 6-11%6\text{-}11\% 9%\downarrow9\% 4.2%\downarrow\mathbf{4.2\%}
LLM (GSM8K/MATH) 2.7/0.4\downarrow2.7/0.4 1.8-5.1/0.0-0.91.8\text{-}5.1/0.0\text{-}0.9 3.2/0.7\downarrow3.2/0.7 2.7/0.6\downarrow2.7/0.6

These results demonstrate that FedGaLore achieves near full-finetuning robustness with low-rank communication and modest server compute requirements; all observed performance improvements are attained without the need to transmit full-model or full-state information, even under severe client data heterogeneity (Peng et al., 2 Feb 2026).

8. Summary and Implications

FedGaLore advances federated fine-tuning by (i) replacing LoRA’s fixed parameter subspace with a locally adaptive, gradient-driven subspace (client-side GaLore), and (ii) synchronizing only the projected second-moment via AJIVE on the server, thereby resolving the principal challenges of update-space and optimizer-state mismatch. The protocol achieves robust, scalable cross-client aggregation in the non-IID regime, closely matching or exceeding the stability and accuracy of full fine-tuning while preserving the efficiency advantages of low-rank adaptation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FedGaLore.