All2All Training: Personalized Collaboration

Updated 29 January 2026

The All2All Training Strategy demonstrates how personalized models are collaboratively trained via selective gradient exchanges to minimize individual loss in heterogeneous environments.
It employs a gradient-filtering update rule with a row-stochastic mixing matrix, ensuring only contributions from similar agents reduce bias and manage variance.
Empirical evaluations on Bernoulli mean estimation reveal superior variance reduction and faster convergence compared to both local SGD and centralized aggregation.

The All2All training strategy, also referred to as the "all-for-all" paradigm, is a collaborative protocol for personalized federated and distributed learning, where each agent in a network maintains its own local model and seeks to minimize its individual loss via information and gradient exchanges with all other agents. The method is grounded in stochastic optimization, incorporates information-theoretic lower bounds on sample efficiency, and is characterized by its gradient-filtering update rule that enables rigorous control of bias–variance trade-offs in the presence of inter-agent data and task heterogeneity (Even et al., 2022).

1. Formal Setup and Objectives

Consider $N$ agents, indexed $i=1,\dots,N$ . Each agent $i$ has access to local data $D_i$ over $\Xi$ and aims to minimize its local objective: $f_i(x) = \mathbb{E}_{\xi\sim D_i}[\ell(x, \xi)], \qquad x\in \mathbb{R}^d$ with $\ell : \mathbb{R}^d \times \Xi \to \mathbb{R}$ , not necessarily smooth. In the All2All scenario, each agent $i$ maintains its own parameter vector $x_i \in \mathbb{R}^d$ . The collective optimization goal is to drive the average personalized loss

$F(x) = \frac{1}{N}\sum_{i=1}^N f_i(x_i)$

to be small in parallel, using only local stochastic gradient oracles and peer-to-peer communication. Two query models are supported:

Synchronous oracle: Each agent simultaneously samples data, computes $g_j^k(x_j^k)$ , and broadcasts the stochastic gradient per round.
Asynchronous oracle: At each iteration, a single agent is randomly selected to update and broadcast its gradient.

2. Information-Theoretic Lower Bounds

The sample complexity $T$ is the total number of queried stochastic gradients (over all agents). For fixed accuracy $\varepsilon>0$ , the following lower bounds apply under standard assumptions:

Convex, possibly non-smooth case: $T \gtrsim \frac{r^2 B^2}{\varepsilon^2} \sum_{i=1}^N \frac{1}{N_i^{(2b)}}$
Strongly convex, $L$ -smooth case with variance $\sigma^2$ : $T \gtrsim \frac{r^2 \sigma^2}{\varepsilon} \sum_{i=1}^N \frac{1}{N_i^{(2b)}}$ where $r$ bounds $\|x^0\|$ and $\|x_i^*\|$ , $B^2$ (resp. $\sigma^2$ ) bounds gradient norms, and

$N_i^{(c)} = \sum_{j=1}^N \mathbf{1}\{b_{ij} \le c\}$

counts the number of agents “ $c$ -close” to $i$ in loss bias $b_{ij} := f_i(x_j^*) - f_i(x_i^*)$ . These bounds reveal that the benefit of collaboration is inherently limited by task similarity and network topology.

3. The Gradient-Filtering All2All Algorithm

All2All utilizes gradient mixing with a symmetric, positive semi-definite weight matrix $W = \Lambda\Lambda^\top$ where $\Lambda = (\lambda_{ij})$ is row-stochastic ( $\sum_j \lambda_{ij}=1$ for all $i$ ). The core update rule in the synchronous setting is: $x_i^{k+1} = x_i^k - \eta \sum_{j\in S^k} W_{ij} g_j^k(x_j^k)$ where $S^k$ indexes the set of agents broadcasting at iteration $k$ . Equivalently, in mixed coordinates: $y^{k+1} = y^k - \eta \Lambda G^k(\Lambda y^k), \qquad x^k = \Lambda y^k$ A filtering operator at agent $i$ is given by: $F_i(g^k) = \sum_{j=1}^N W_{ij} g_j^k = \sum_{j=1}^N \sum_{\ell=1}^N \lambda_{i\ell} \lambda_{j\ell} g_j^k$ The design of $\Lambda$ leverages estimates of inter-agent bias: for application target precision $\varepsilon$ , the entries are set as

$\lambda_{ij} = \frac{\mathbf{1}\{b_{ij} \le 2\varepsilon\}}{N_i^{(2\varepsilon)}}, \quad N_i^{(2\varepsilon)} = \sum_j \mathbf{1}\{b_{ij} \le 2\varepsilon\}$

ensuring each agent only aggregates gradients from others within bias tolerance $2\varepsilon$ . The following pseudocode summarizes the protocol:

Input: stepsize η>0, mixing matrix W=ΛΛᵀ
Initialize x_i⁰=x⁰ for i=1…N
for k=0,1,2… do
    Agents j∈Sᵏ compute g_jᵏ = stochastic gradient at x_jᵏ
    Broadcast g_jᵏ to all i with Wᵢⱼ>0
    For each agent i:
        x_iᵏ⁺¹ = x_iᵏ − η · Σ_{j∈Sᵏ} Wᵢⱼ · g_jᵏ(x_jᵏ)
end for

4. Convergence Analysis and Bias–Variance Trade-Off

For local excess loss $F_i^k = f_i(x_i^k) - f_i(x_i^*)$ and average $F^k = \frac{1}{N}\sum_{i=1}^N F_i^k$ , the All2All strategy achieves:

Convex, bounded-variance regime (Assumptions N.2, B.2): for step size $\eta = \sqrt{2ND^2/(KB^2 \sum \lambda_{ij}^2)}$ ,

$\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1}F^k\right] \le \sqrt{\frac{2B^2 \sum_{i} \|x_i^0 - x_i^\Lambda\|^2}{NK} \sum_{i,j} \lambda_{ij}^2} + \frac{1}{N}\sum_{i,j} \lambda_{ij} b_{ij}$

The first term decays as $O(1/\sqrt{K})$ ("statistical variance"); the second represents residual bias from heterogeneity.

Strongly convex, $L$ -smooth regime (Assumption N.1): for $\eta \approx 1/L$ , linear convergence holds: $\mathbb{E}[F^K] \le F^0 e^{-K/(2\kappa)} + \frac{L \sigma^2}{\mu^2 N}\sum_{i,j} \lambda_{ij}^2 + \frac{1}{N}\sum_{i,j} \lambda_{ij} b_{ij}, \qquad \kappa = L/\mu$ Choice of $\lambda_{ij}$ —uniform over close agents—yields bias–variance terms matching the lower bounds up to constants.

5. Key Assumptions and Weight Matrix Design

The All2All paradigm critically relies on several technical assumptions:

Bias/Similarity between tasks:
- (B.1) $f_i(x_j^*) - f_i(x_i^*) \le b_{ij}$
- (B.2) $\|\nabla f_i(x) - \nabla f_j(x)\|^2 \le \tilde{b}_{ij}$ ; for strongly convex/Lipschitz functions, $\tilde{b}_{ij} = (b_{ij}/r)^2$ .
Regularity/noise:
- (N.1) $\mu$ -strongly convex, $L$ -smooth, $\mathbb{E}\|g_i^k(x) - \nabla f_i(x)\|^2 \le \sigma^2$ .
- (N.2) Convex, subgradient oracle with $\mathbb{E}\|g_i^k(x)\|^2 \le B^2$ .
Weight matrix properties: $W = \Lambda\Lambda^\top$ , $\Lambda$ row-stochastic, possibly time-varying/adaptive.
Domain boundedness: $\|x^k - x_i^*\| \le D$ (or appropriate relaxations for unbounded cases).

The design of the mixing matrix $\Lambda$ and resulting $W$ directly implements gradient filtering, limiting contribution to those with sufficiently similar data/tasks.

6. Empirical Evaluation

All2All has been evaluated on collaborative mean estimation tasks for Bernoulli parameters. The experimental configuration uses $N=100$ agents, each with $p_i \sim \text{Unif}[0,1]$ , locally drawing $10^3$ samples from $\text{Bernoulli}(p_i)$ . The method is compared to:

No-collaboration: Each agent runs local SGD.
Single global model: Centralized SGD/FedAvg across agents.
All2All: Filtering method with optimal $\lambda$ .

Performance is measured by average local mean squared error, $\frac{1}{N}\sum_i (x_i^t-p_i)^2/2$ . All2All demonstrates superior non-asymptotic decay, offering variance reduction by collaboration and achieving lower asymptotic MSE via effective bias–variance management. The protocol retains robustness to moderate noise in bias estimates $b_{ij}$ .

7. Synthesis and Context

All2All is a general, sample-optimal collaborative training protocol for non-identically distributed environments, premised on personalized aggregation of stochastically filtered gradients. By constructing the mixing matrix to filter updates from “close” agents, All2All provably attains the optimal bias–variance trade-off, with empirical results confirming practical performance and robustness. This provides a unifying perspective connecting classical decentralized optimization, collaborative learning, and personalized federated learning, situating All2All as an effective strategy under standard regularity and similarity assumptions (Even et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Sample Optimality and All-for-all Strategies in Personalized Federated and Collaborative Learning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to All2All Training Strategy.