Papers
Topics
Authors
Recent
Search
2000 character limit reached

Federated Attentive Message Passing (FedAMP)

Updated 25 April 2026
  • FedAMP is a federated learning strategy that uses pairwise adaptive attention to personalize client models on non-IID data.
  • It alternates between attentive message passing on the server and local proximal updates on client devices to optimize personalized parameters.
  • Empirical evaluations on FMNIST, EMNIST, and CIFAR100 show improved accuracy over traditional federated methods with robust convergence guarantees.

Federated Attentive Message Passing (FedAMP) is a federated learning methodology designed to enable client models to collaborate via adaptive, pairwise attention mechanisms, with the specific goal of improving performance on non-IID data distributions. By leveraging attention-inducing communication between models, FedAMP personalizes learned parameters for each client while maximizing the benefits of inter-client similarity. The approach was introduced as a solution to the persistent challenge of non-IID data in cross-silo federated learning, offering provable convergence, practical robustness, and demonstrably superior empirical results when compared to established methods (Huang et al., 2020).

1. Formal Problem Framework and Objective

Let KK denote the number of clients, each indexed by i=1,…,Ki=1,\ldots,K. Each client ii holds:

  • A private dataset DiD_i sampled from distribution PiP_i (non-IID over ii).
  • Local model parameters wi∈Rdw_i \in \mathbb{R}^d.
  • A loss function Fi(wi):=∣Di∣−1∑(x,y)∈Diâ„“(wi;x,y)F_i(w_i) := |D_i|^{-1} \sum_{(x,y)\in D_i} \ell(w_i;x,y).

The global objective is to learn personalized parameters {wi}\{w_i\} such that each wiw_i is near-optimal for its own distribution i=1,…,Ki=1,\ldots,K0, while still exploiting cross-client similarities. This leads to the following aggregate optimization target:

i=1,…,Ki=1,\ldots,K1

where i=1,…,Ki=1,\ldots,K2, i=1,…,Ki=1,\ldots,K3 balances personalization/collaboration, and i=1,…,Ki=1,\ldots,K4 is a concave, increasing penalty that induces attention. An example is i=1,…,Ki=1,\ldots,K5.

2. FedAMP Algorithmic Structure

FedAMP implements an alternating incremental-proximal optimization on i=1,…,Ki=1,\ldots,K6. Each communication round i=1,…,Ki=1,\ldots,K7 proceeds as follows:

  • Message-Passing / Attention Step (Server-Side):

For each client i=1,…,Ki=1,\ldots,K8, compute the attentive aggregate:

i=1,…,Ki=1,\ldots,K9

where ii0 for ii1, and ii2. The update can also be regarded as a perturbed gradient step:

ii3

  • Local Proximal Update (Client-Side):

Each client ii4 solves:

ii5

In practice this is implemented via a small number of local SGD or Adam steps.

The algorithm iterates these two steps for ii6. Pseudocode matching the above logic is presented in the original work.

3. Attention Mechanism and Similarity Adaptation

The attention kernel ii7 serves as a nonincreasing, nonnegative similarity function:

  • Small ii8 yields large ii9, encouraging strong pairwise collaboration.
  • Large DiD_i0 yields small DiD_i1, limiting influence across dissimilar clients.

A widely used instantiation is the RBF kernel, DiD_i2, so DiD_i3. Consequently, DiD_i4.

The attention coefficients DiD_i5 thus implement a form of adaptive, pairwise, non-linear communication, automatically amplifying within-cluster collaboration on non-IID data.

4. Theoretical Convergence Analysis

FedAMP offers convergence guarantees for both convex and nonconvex formulations of the objective DiD_i6, under bounded-gradient assumptions:

  • Convex Case: If each DiD_i7 and DiD_i8 are convex, and DiD_i9,

PiP_i0

Diminishing PiP_i1 ensuring PiP_i2 yields PiP_i3.

  • Smooth, Nonconvex Case: If PiP_i4 and PiP_i5 are PiP_i6-smooth, and PiP_i7,

PiP_i8

With diminishing PiP_i9 as above, any limit point of ii0 is stationary.

The two-step update is interpretable as a proximal-gradient procedure, and analysis leverages established incremental/proximal methods.

5. Heuristic Extension for Deep Neural Models

For high-dimensional parameterizations (ii1 large, as in DNNs), Euclidean distances become less meaningful. The heuristic variant "HeurFedAMP" alters the computation of ii2:

  • Set self-attention ii3 to ii4 (e.g., ii5).
  • For ii6,

ii7

where ii8 is cosine similarity and ii9 a temperature parameter.

This maintains wi∈Rdw_i \in \mathbb{R}^d0 while biasing attention based on angular rather than Euclidean closeness, empirically improving performance on DNNs.

6. Empirical Evaluation and Results

FedAMP and its heuristic extension are evaluated on MNIST, FMNIST, EMNIST, and CIFAR100 datasets with client partitions covering IID, pathological non-IID (each client only 2 labels), and practical non-IID (clients in 3 clusters with unbalanced samples).

Mean testing accuracy (BMTA) under the practical non-IID scenario (mean over clients):

Dataset FedAvg FedProx APFL FedAMP HeurFedAMP
FMNIST 79.5% 78.7% 84.1% 91.0% 91.4%
EMNIST N/A N/A N/A 81.2% 81.5%
CIFAR100 35.2% 37.3% N/A N/A 53.3%

Pairwise attention heatmaps (EMNIST, clients 0–61) reveal that attention coefficients form clear blocks, aligning with ground-truth clusters—FedAMP automatically learns and exploits such latent structure.

7. Practical Guidelines and Implications

Key operational insights include:

  • Data regime sensitivity: On IID data, FedAMP reduces to global averaging (like FedAvg); on clustered non-IID data, it amplifies within-cluster collaboration.
  • Hyperparameters: wi∈Rdw_i \in \mathbb{R}^d1 balances personalization/collaboration; initial wi∈Rdw_i \in \mathbb{R}^d2 should be wi∈Rdw_i \in \mathbb{R}^d3 then decay wi∈Rdw_i \in \mathbb{R}^d4; attention kernel wi∈Rdw_i \in \mathbb{R}^d5 must be tuned; self-attention wi∈Rdw_i \in \mathbb{R}^d6 in HeurFedAMP set to wi∈Rdw_i \in \mathbb{R}^d7.
  • Robustness: Proximal step only requires available wi∈Rdw_i \in \mathbb{R}^d8—drops are naturally handled; attention down-weights corrupted or noisy clients, conferring resilience to label noise.

FedAMP constitutes a principled, provably convergent, and empirically validated framework for federated learning with adaptive, pairwise, non-linear collaboration, with particular effectiveness on non-IID problems and high-dimensional models (Huang et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Federated Attentive Message Passing (FedAMP).