HierFAVG: Hierarchical Federated Averaging

Updated 11 January 2026

HierFAVG is a hierarchical federated learning algorithm that leverages a three-tier client–edge–cloud structure to balance computation and communication trade-offs.
The algorithm employs local SGD updates along with intermediate edge and global cloud aggregations, ensuring theoretical convergence while mitigating model drift.
Empirical results on MNIST and CIFAR-10 show up to 3–4× faster training and a 30% reduction in device energy consumption compared to standard FedAvg.

HierFAVG (Hierarchical Federated Averaging) is a multi-tier federated learning algorithm designed to efficiently train machine learning models across decentralized datasets distributed over a client–edge–cloud architecture. Unlike classical FedAvg, which uses a single parameter server (typically in the cloud or at the edge), HierFAVG introduces an intermediate aggregation layer at the edge servers between clients and a central cloud server. This hierarchy enables partial aggregation at multiple levels, providing favorable computation–communication trade-offs and significant improvements in training speed and device energy efficiency while maintaining strong theoretical convergence guarantees (Liu et al., 2019, Yang et al., 2022).

1. Hierarchical Architecture and Optimization Problem

HierFAVG operates in a three-tier network comprising clients (also called workers), edge servers, and a cloud server. Each client $(i,\ell)$ in edge cluster $\ell$ holds a private dataset $\mathcal{D}_i^\ell$ ; edge $\ell$ aggregates data from its assigned clients $\mathcal{C}^\ell$ ; the cloud aggregates across all $L$ edge servers. The global learning objective is to minimize the loss

$F(w) = \frac{1}{|\mathcal{D}|} \sum_{j \in \mathcal{D}} f_j(w),$

where $\mathcal{D}$ denotes the union of all client datasets. Data may be statistically non-IID at both the worker and edge levels, quantified via client–edge gradient divergence $\delta_i^\ell$ and edge–cloud divergence $\Delta^\ell$ .

The hierarchical structure allows adaptation to realistic network and data scenarios, where direct client–cloud communication is costly or infeasible, and edge resources can be leveraged for intermediate aggregation (Liu et al., 2019, Yang et al., 2022).

2. Algorithmic Workflow and Model Update Rules

HierFAVG proceeds in rounds, alternating between local SGD updates at clients, periodic client–edge aggregations, and less frequent global (edge–cloud) aggregations. Core parameters include:

$\eta$ : learning rate
$\kappa_1$ (or $\tau$ ): number of local SGD steps between edge aggregations
$\kappa_2$ (or $\pi$ ): number of edge aggregations between cloud synchronizations

Let $w_i^\ell(k)$ denote client $(i,\ell)$ ’s model at step $k$ , $w^\ell(k)$ the edge model, and $w(k)$ the cloud/global model. The update rules are:

Local SGD at Clients:

$w_i^\ell(k+1) = w_i^\ell(k) - \eta \nabla F_i^\ell(w_i^\ell(k))$

Edge Aggregation (every $\kappa_1$ steps):

$w^\ell(k) = \frac{1}{|\mathcal{D}^\ell|} \sum_{i \in \mathcal{C}^\ell} |\mathcal{D}_i^\ell| w_i^\ell(k)$

Cloud Aggregation (every $\kappa_1 \kappa_2$ steps):

$w(k) = \frac{1}{|\mathcal{D}|} \sum_{\ell=1}^{L} |\mathcal{D}^\ell| w^\ell(k)$

After aggregation, the new aggregate is broadcast back: edges to clients and the cloud to all edges and clients.

The algorithm cycles through these steps for $K=B\cdot\kappa_1\cdot\kappa_2$ total local SGD updates, with suitable initialization and synchronization at each level (Liu et al., 2019, Yang et al., 2022).

3. Mathematical Formulation and Model Drift

Between successive cloud aggregations, the local client update is simply SGD. Edge and cloud aggregation events replace all participating models with the respective weighted averages. This structure induces hierarchical model drift owing to updates on increasingly stale local parameters.

The full update at a cloud synchronization can be expressed as: $w(k+1) = \frac{1}{|\mathcal{D}|} \sum_{\ell=1}^L |\mathcal{D}^\ell| \left( \frac{1}{|\mathcal{D}^\ell|} \sum_{i \in \mathcal{C}^\ell} |\mathcal{D}_i^\ell| (w_i^\ell(k) - \eta \nabla F_i^\ell(w_i^\ell(k))) \right)$ whenever $k \bmod (\kappa_1 \kappa_2) = 0$ .

The hierarchical approach introduces two main sources of divergence:

Client–edge divergence $\delta$ : measures statistical heterogeneity between a client and its edge-level aggregate.
Edge–cloud divergence $\Delta$ : measures statistical heterogeneity between an edge and the global average.

These divergences, along with $\kappa_1$ and $\kappa_2$ , determine how far models can drift from global optima during intermediate local and edge phases (Liu et al., 2019, Yang et al., 2022).

4. Convergence Results and Theoretical Analysis

HierFAVG provides rigorous convergence guarantees under standard smoothness and bounded-divergence assumptions. For convex objectives, the deviation from the optimum is bounded by (Liu et al., 2019): $F(w(K)) - F(w^*) \leq \frac{1}{B[\,\eta\phi - (\rho\,G_c)/(\kappa_1\kappa_2\epsilon^2)\,]}$ where $G_c$ quantifies hierarchical drift: $G_c = h(\kappa_1\kappa_2, \Delta, \eta) + \frac{1}{2}(\kappa_2^2 + \kappa_2 - 1)(\kappa_1 + 1) h(\kappa_1, \delta, \eta)$ with $h(n, \delta, \eta) = \frac{\delta}{\beta}\left[(1+\eta\beta)^n - 1\right] - \eta\beta n$ .

In the non-convex setting, HierFAVG ensures that the time-averaged squared gradient norm converges to a neighborhood of zero, with the radius determined by cumulative drift.

A closely related independent analysis confirms a sublinear $O(1/T)$ rate (for total iterations $T$ ), plus a heterogeneity penalty $j(\tau,\pi)$ that increases with the length of local and edge intervals (Yang et al., 2022).

5. Communication–Computation Trade-Offs and Parameter Tuning

HierFAVG’s hierarchical structure allows explicit tuning of $\kappa_1$ and $\kappa_2$ (or $\tau, \pi$ ) to balance localized computation against the overhead of communication at each tier:

Smaller $\kappa_1$ (frequent edge aggregation): Reduces model drift and accelerates convergence but increases communication between clients and edges.
Larger $\kappa_2$ (infrequent cloud aggregation): Reduces global communication but can increase overall drift, especially if edge datasets are non-IID.

With edge-level data being IID ( $\Delta \approx 0$ ), increasing $\kappa_2$ does not degrade convergence, enabling significant communication savings.

Design guidelines recommend small $\kappa_1$ where client–edge communication is cheap, large $\kappa_2$ when edge–cloud communication is expensive and edges see homogeneous data, and adaptive tuning of both in heterogeneous and resource-constrained environments. Diminishing step sizes $\eta$ are advised for asymptotic optimality in convex problems (Liu et al., 2019, Yang et al., 2022).

6. Empirical Performance and Limitations

Empirical experiments using CNNs on MNIST and CIFAR-10, with various non-IID data configurations, confirm the theoretical findings:

Training speed: Wall-clock time to a fixed target accuracy is reduced by 3–4× compared to cloud-only FedAvg for both datasets (e.g., from $\approx 1.1\times 10^5$ s to $\approx 4.9\times 10^4$ s for CIFAR-10 at $70\%$ accuracy).
Device energy: End-device energy consumption is cut by up to $30\%$ for MNIST due to reduced uplink bandwidth and more efficient computation schedules.
Parameter sensitivity: Reducing $\kappa_1$ accelerates convergence; increasing $\kappa_2$ is safe only when edge-level data are homogeneous.
Heterogeneity sensitivity: Model drift and convergence slow dramatically if $\kappa_1$ or $\kappa_2$ are too large in highly non-IID cases.

HierFAVG’s main limitations are heightened sensitivity to large aggregation periods in the presence of data heterogeneity and delayed global alignment due to infrequent cloud aggregations. These limitations motivate newer variants, such as HierMo, which layer momentum on top of the HierFAVG baseline for provably tighter convergence bounds (Yang et al., 2022).

7. Extensions and Comparative Remarks

HierFAVG represents the archetype of multi-tier model-averaging in federated learning. Its simplicity enables easy analysis and practical deployment in heterogeneous networks, but leaves open the challenge of mitigating model drift in highly non-IID settings or under infrequent synchronization.

Recent work demonstrates that injecting momentum at either or both the worker and edge tiers (as in HierMo) yields strictly superior convergence rates, especially for deep or nonconvex models, by reducing oscillation and steady-state error due to drift. Optimization of aggregation periods, as in HierOPT, further refines the computation-communication trade-off.

A plausible implication is that the design and tuning of multi-tier FL algorithms in realistic networks should increasingly incorporate drift-mitigation techniques (such as momentum, adaptive periods, or personalized models) as network and data heterogeneity intensifies (Yang et al., 2022, Liu et al., 2019).

Markdown Upgrade to Chat

References (2)

Client-Edge-Cloud Hierarchical Federated Learning (2019)

Hierarchical Federated Learning with Momentum Acceleration in Multi-Tier Networks (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HierFAVG Algorithm.