Papers
Topics
Authors
Recent
Search
2000 character limit reached

HierFAVG: Hierarchical Federated Averaging

Updated 11 January 2026
  • HierFAVG is a hierarchical federated learning algorithm that leverages a three-tier client–edge–cloud structure to balance computation and communication trade-offs.
  • The algorithm employs local SGD updates along with intermediate edge and global cloud aggregations, ensuring theoretical convergence while mitigating model drift.
  • Empirical results on MNIST and CIFAR-10 show up to 3–4× faster training and a 30% reduction in device energy consumption compared to standard FedAvg.

HierFAVG (Hierarchical Federated Averaging) is a multi-tier federated learning algorithm designed to efficiently train machine learning models across decentralized datasets distributed over a client–edge–cloud architecture. Unlike classical FedAvg, which uses a single parameter server (typically in the cloud or at the edge), HierFAVG introduces an intermediate aggregation layer at the edge servers between clients and a central cloud server. This hierarchy enables partial aggregation at multiple levels, providing favorable computation–communication trade-offs and significant improvements in training speed and device energy efficiency while maintaining strong theoretical convergence guarantees (Liu et al., 2019, Yang et al., 2022).

1. Hierarchical Architecture and Optimization Problem

HierFAVG operates in a three-tier network comprising clients (also called workers), edge servers, and a cloud server. Each client (i,)(i,\ell) in edge cluster \ell holds a private dataset Di\mathcal{D}_i^\ell; edge \ell aggregates data from its assigned clients C\mathcal{C}^\ell; the cloud aggregates across all LL edge servers. The global learning objective is to minimize the loss

F(w)=1DjDfj(w),F(w) = \frac{1}{|\mathcal{D}|} \sum_{j \in \mathcal{D}} f_j(w),

where D\mathcal{D} denotes the union of all client datasets. Data may be statistically non-IID at both the worker and edge levels, quantified via client–edge gradient divergence δi\delta_i^\ell and edge–cloud divergence Δ\Delta^\ell.

The hierarchical structure allows adaptation to realistic network and data scenarios, where direct client–cloud communication is costly or infeasible, and edge resources can be leveraged for intermediate aggregation (Liu et al., 2019, Yang et al., 2022).

2. Algorithmic Workflow and Model Update Rules

HierFAVG proceeds in rounds, alternating between local SGD updates at clients, periodic client–edge aggregations, and less frequent global (edge–cloud) aggregations. Core parameters include:

  • η\eta: learning rate
  • κ1\kappa_1 (or τ\tau): number of local SGD steps between edge aggregations
  • κ2\kappa_2 (or π\pi): number of edge aggregations between cloud synchronizations

Let wi(k)w_i^\ell(k) denote client (i,)(i,\ell)’s model at step kk, w(k)w^\ell(k) the edge model, and w(k)w(k) the cloud/global model. The update rules are:

  • Local SGD at Clients:

wi(k+1)=wi(k)ηFi(wi(k))w_i^\ell(k+1) = w_i^\ell(k) - \eta \nabla F_i^\ell(w_i^\ell(k))

  • Edge Aggregation (every κ1\kappa_1 steps):

w(k)=1DiCDiwi(k)w^\ell(k) = \frac{1}{|\mathcal{D}^\ell|} \sum_{i \in \mathcal{C}^\ell} |\mathcal{D}_i^\ell| w_i^\ell(k)

  • Cloud Aggregation (every κ1κ2\kappa_1 \kappa_2 steps):

w(k)=1D=1LDw(k)w(k) = \frac{1}{|\mathcal{D}|} \sum_{\ell=1}^{L} |\mathcal{D}^\ell| w^\ell(k)

After aggregation, the new aggregate is broadcast back: edges to clients and the cloud to all edges and clients.

The algorithm cycles through these steps for K=Bκ1κ2K=B\cdot\kappa_1\cdot\kappa_2 total local SGD updates, with suitable initialization and synchronization at each level (Liu et al., 2019, Yang et al., 2022).

3. Mathematical Formulation and Model Drift

Between successive cloud aggregations, the local client update is simply SGD. Edge and cloud aggregation events replace all participating models with the respective weighted averages. This structure induces hierarchical model drift owing to updates on increasingly stale local parameters.

The full update at a cloud synchronization can be expressed as: w(k+1)=1D=1LD(1DiCDi(wi(k)ηFi(wi(k))))w(k+1) = \frac{1}{|\mathcal{D}|} \sum_{\ell=1}^L |\mathcal{D}^\ell| \left( \frac{1}{|\mathcal{D}^\ell|} \sum_{i \in \mathcal{C}^\ell} |\mathcal{D}_i^\ell| (w_i^\ell(k) - \eta \nabla F_i^\ell(w_i^\ell(k))) \right) whenever kmod(κ1κ2)=0k \bmod (\kappa_1 \kappa_2) = 0.

The hierarchical approach introduces two main sources of divergence:

  • Client–edge divergence δ\delta: measures statistical heterogeneity between a client and its edge-level aggregate.
  • Edge–cloud divergence Δ\Delta: measures statistical heterogeneity between an edge and the global average.

These divergences, along with κ1\kappa_1 and κ2\kappa_2, determine how far models can drift from global optima during intermediate local and edge phases (Liu et al., 2019, Yang et al., 2022).

4. Convergence Results and Theoretical Analysis

HierFAVG provides rigorous convergence guarantees under standard smoothness and bounded-divergence assumptions. For convex objectives, the deviation from the optimum is bounded by (Liu et al., 2019): F(w(K))F(w)1B[ηϕ(ρGc)/(κ1κ2ϵ2)]F(w(K)) - F(w^*) \leq \frac{1}{B[\,\eta\phi - (\rho\,G_c)/(\kappa_1\kappa_2\epsilon^2)\,]} where GcG_c quantifies hierarchical drift: Gc=h(κ1κ2,Δ,η)+12(κ22+κ21)(κ1+1)h(κ1,δ,η)G_c = h(\kappa_1\kappa_2, \Delta, \eta) + \frac{1}{2}(\kappa_2^2 + \kappa_2 - 1)(\kappa_1 + 1) h(\kappa_1, \delta, \eta) with h(n,δ,η)=δβ[(1+ηβ)n1]ηβnh(n, \delta, \eta) = \frac{\delta}{\beta}\left[(1+\eta\beta)^n - 1\right] - \eta\beta n.

In the non-convex setting, HierFAVG ensures that the time-averaged squared gradient norm converges to a neighborhood of zero, with the radius determined by cumulative drift.

A closely related independent analysis confirms a sublinear O(1/T)O(1/T) rate (for total iterations TT), plus a heterogeneity penalty j(τ,π)j(\tau,\pi) that increases with the length of local and edge intervals (Yang et al., 2022).

5. Communication–Computation Trade-Offs and Parameter Tuning

HierFAVG’s hierarchical structure allows explicit tuning of κ1\kappa_1 and κ2\kappa_2 (or τ,π\tau, \pi) to balance localized computation against the overhead of communication at each tier:

  • Smaller κ1\kappa_1 (frequent edge aggregation): Reduces model drift and accelerates convergence but increases communication between clients and edges.
  • Larger κ2\kappa_2 (infrequent cloud aggregation): Reduces global communication but can increase overall drift, especially if edge datasets are non-IID.

With edge-level data being IID (Δ0\Delta \approx 0), increasing κ2\kappa_2 does not degrade convergence, enabling significant communication savings.

Design guidelines recommend small κ1\kappa_1 where client–edge communication is cheap, large κ2\kappa_2 when edge–cloud communication is expensive and edges see homogeneous data, and adaptive tuning of both in heterogeneous and resource-constrained environments. Diminishing step sizes η\eta are advised for asymptotic optimality in convex problems (Liu et al., 2019, Yang et al., 2022).

6. Empirical Performance and Limitations

Empirical experiments using CNNs on MNIST and CIFAR-10, with various non-IID data configurations, confirm the theoretical findings:

  • Training speed: Wall-clock time to a fixed target accuracy is reduced by 3–4× compared to cloud-only FedAvg for both datasets (e.g., from 1.1×105\approx 1.1\times 10^5 s to 4.9×104\approx 4.9\times 10^4 s for CIFAR-10 at 70%70\% accuracy).
  • Device energy: End-device energy consumption is cut by up to 30%30\% for MNIST due to reduced uplink bandwidth and more efficient computation schedules.
  • Parameter sensitivity: Reducing κ1\kappa_1 accelerates convergence; increasing κ2\kappa_2 is safe only when edge-level data are homogeneous.
  • Heterogeneity sensitivity: Model drift and convergence slow dramatically if κ1\kappa_1 or κ2\kappa_2 are too large in highly non-IID cases.

HierFAVG’s main limitations are heightened sensitivity to large aggregation periods in the presence of data heterogeneity and delayed global alignment due to infrequent cloud aggregations. These limitations motivate newer variants, such as HierMo, which layer momentum on top of the HierFAVG baseline for provably tighter convergence bounds (Yang et al., 2022).

7. Extensions and Comparative Remarks

HierFAVG represents the archetype of multi-tier model-averaging in federated learning. Its simplicity enables easy analysis and practical deployment in heterogeneous networks, but leaves open the challenge of mitigating model drift in highly non-IID settings or under infrequent synchronization.

Recent work demonstrates that injecting momentum at either or both the worker and edge tiers (as in HierMo) yields strictly superior convergence rates, especially for deep or nonconvex models, by reducing oscillation and steady-state error due to drift. Optimization of aggregation periods, as in HierOPT, further refines the computation-communication trade-off.

A plausible implication is that the design and tuning of multi-tier FL algorithms in realistic networks should increasingly incorporate drift-mitigation techniques (such as momentum, adaptive periods, or personalized models) as network and data heterogeneity intensifies (Yang et al., 2022, Liu et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HierFAVG Algorithm.