Papers
Topics
Authors
Recent
Search
2000 character limit reached

All2All Training: Personalized Collaboration

Updated 29 January 2026
  • The All2All Training Strategy demonstrates how personalized models are collaboratively trained via selective gradient exchanges to minimize individual loss in heterogeneous environments.
  • It employs a gradient-filtering update rule with a row-stochastic mixing matrix, ensuring only contributions from similar agents reduce bias and manage variance.
  • Empirical evaluations on Bernoulli mean estimation reveal superior variance reduction and faster convergence compared to both local SGD and centralized aggregation.

The All2All training strategy, also referred to as the "all-for-all" paradigm, is a collaborative protocol for personalized federated and distributed learning, where each agent in a network maintains its own local model and seeks to minimize its individual loss via information and gradient exchanges with all other agents. The method is grounded in stochastic optimization, incorporates information-theoretic lower bounds on sample efficiency, and is characterized by its gradient-filtering update rule that enables rigorous control of bias–variance trade-offs in the presence of inter-agent data and task heterogeneity (Even et al., 2022).

1. Formal Setup and Objectives

Consider NN agents, indexed i=1,,Ni=1,\dots,N. Each agent ii has access to local data DiD_i over Ξ\Xi and aims to minimize its local objective: fi(x)=EξDi[(x,ξ)],xRdf_i(x) = \mathbb{E}_{\xi\sim D_i}[\ell(x, \xi)], \qquad x\in \mathbb{R}^d with :Rd×ΞR\ell : \mathbb{R}^d \times \Xi \to \mathbb{R}, not necessarily smooth. In the All2All scenario, each agent ii maintains its own parameter vector xiRdx_i \in \mathbb{R}^d. The collective optimization goal is to drive the average personalized loss

F(x)=1Ni=1Nfi(xi)F(x) = \frac{1}{N}\sum_{i=1}^N f_i(x_i)

to be small in parallel, using only local stochastic gradient oracles and peer-to-peer communication. Two query models are supported:

  • Synchronous oracle: Each agent simultaneously samples data, computes gjk(xjk)g_j^k(x_j^k), and broadcasts the stochastic gradient per round.
  • Asynchronous oracle: At each iteration, a single agent is randomly selected to update and broadcast its gradient.

2. Information-Theoretic Lower Bounds

The sample complexity TT is the total number of queried stochastic gradients (over all agents). For fixed accuracy ε>0\varepsilon>0, the following lower bounds apply under standard assumptions:

  • Convex, possibly non-smooth case: Tr2B2ε2i=1N1Ni(2b)T \gtrsim \frac{r^2 B^2}{\varepsilon^2} \sum_{i=1}^N \frac{1}{N_i^{(2b)}}
  • Strongly convex, LL-smooth case with variance σ2\sigma^2: Tr2σ2εi=1N1Ni(2b)T \gtrsim \frac{r^2 \sigma^2}{\varepsilon} \sum_{i=1}^N \frac{1}{N_i^{(2b)}} where rr bounds x0\|x^0\| and xi\|x_i^*\|, B2B^2 (resp. σ2\sigma^2) bounds gradient norms, and

Ni(c)=j=1N1{bijc}N_i^{(c)} = \sum_{j=1}^N \mathbf{1}\{b_{ij} \le c\}

counts the number of agents “cc-close” to ii in loss bias bij:=fi(xj)fi(xi)b_{ij} := f_i(x_j^*) - f_i(x_i^*). These bounds reveal that the benefit of collaboration is inherently limited by task similarity and network topology.

3. The Gradient-Filtering All2All Algorithm

All2All utilizes gradient mixing with a symmetric, positive semi-definite weight matrix W=ΛΛW = \Lambda\Lambda^\top where Λ=(λij)\Lambda = (\lambda_{ij}) is row-stochastic (jλij=1\sum_j \lambda_{ij}=1 for all ii). The core update rule in the synchronous setting is: xik+1=xikηjSkWijgjk(xjk)x_i^{k+1} = x_i^k - \eta \sum_{j\in S^k} W_{ij} g_j^k(x_j^k) where SkS^k indexes the set of agents broadcasting at iteration kk. Equivalently, in mixed coordinates: yk+1=ykηΛGk(Λyk),xk=Λyky^{k+1} = y^k - \eta \Lambda G^k(\Lambda y^k), \qquad x^k = \Lambda y^k A filtering operator at agent ii is given by: Fi(gk)=j=1NWijgjk=j=1N=1NλiλjgjkF_i(g^k) = \sum_{j=1}^N W_{ij} g_j^k = \sum_{j=1}^N \sum_{\ell=1}^N \lambda_{i\ell} \lambda_{j\ell} g_j^k The design of Λ\Lambda leverages estimates of inter-agent bias: for application target precision ε\varepsilon, the entries are set as

λij=1{bij2ε}Ni(2ε),Ni(2ε)=j1{bij2ε}\lambda_{ij} = \frac{\mathbf{1}\{b_{ij} \le 2\varepsilon\}}{N_i^{(2\varepsilon)}}, \quad N_i^{(2\varepsilon)} = \sum_j \mathbf{1}\{b_{ij} \le 2\varepsilon\}

ensuring each agent only aggregates gradients from others within bias tolerance 2ε2\varepsilon. The following pseudocode summarizes the protocol:

1
2
3
4
5
6
7
8
Input: stepsize η>0, mixing matrix W=ΛΛᵀ
Initialize x_i⁰=x⁰ for i=1…N
for k=0,1,2… do
    Agents j∈Sᵏ compute g_jᵏ = stochastic gradient at x_jᵏ
    Broadcast g_jᵏ to all i with Wᵢⱼ>0
    For each agent i:
        x_iᵏ⁺¹ = x_iᵏ − η · Σ_{j∈Sᵏ} Wᵢⱼ · g_jᵏ(x_jᵏ)
end for

4. Convergence Analysis and Bias–Variance Trade-Off

For local excess loss Fik=fi(xik)fi(xi)F_i^k = f_i(x_i^k) - f_i(x_i^*) and average Fk=1Ni=1NFikF^k = \frac{1}{N}\sum_{i=1}^N F_i^k, the All2All strategy achieves:

  • Convex, bounded-variance regime (Assumptions N.2, B.2): for step size η=2ND2/(KB2λij2)\eta = \sqrt{2ND^2/(KB^2 \sum \lambda_{ij}^2)},

E[1Kk=0K1Fk]2B2ixi0xiΛ2NKi,jλij2+1Ni,jλijbij\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1}F^k\right] \le \sqrt{\frac{2B^2 \sum_{i} \|x_i^0 - x_i^\Lambda\|^2}{NK} \sum_{i,j} \lambda_{ij}^2} + \frac{1}{N}\sum_{i,j} \lambda_{ij} b_{ij}

The first term decays as O(1/K)O(1/\sqrt{K}) ("statistical variance"); the second represents residual bias from heterogeneity.

  • Strongly convex, LL-smooth regime (Assumption N.1): for η1/L\eta \approx 1/L, linear convergence holds: E[FK]F0eK/(2κ)+Lσ2μ2Ni,jλij2+1Ni,jλijbij,κ=L/μ\mathbb{E}[F^K] \le F^0 e^{-K/(2\kappa)} + \frac{L \sigma^2}{\mu^2 N}\sum_{i,j} \lambda_{ij}^2 + \frac{1}{N}\sum_{i,j} \lambda_{ij} b_{ij}, \qquad \kappa = L/\mu Choice of λij\lambda_{ij}—uniform over close agents—yields bias–variance terms matching the lower bounds up to constants.

5. Key Assumptions and Weight Matrix Design

The All2All paradigm critically relies on several technical assumptions:

  • Bias/Similarity between tasks:
    • (B.1) fi(xj)fi(xi)bijf_i(x_j^*) - f_i(x_i^*) \le b_{ij}
    • (B.2) fi(x)fj(x)2b~ij\|\nabla f_i(x) - \nabla f_j(x)\|^2 \le \tilde{b}_{ij}; for strongly convex/Lipschitz functions, b~ij=(bij/r)2\tilde{b}_{ij} = (b_{ij}/r)^2.
  • Regularity/noise:
    • (N.1) μ\mu-strongly convex, LL-smooth, Egik(x)fi(x)2σ2\mathbb{E}\|g_i^k(x) - \nabla f_i(x)\|^2 \le \sigma^2.
    • (N.2) Convex, subgradient oracle with Egik(x)2B2\mathbb{E}\|g_i^k(x)\|^2 \le B^2.
  • Weight matrix properties: W=ΛΛW = \Lambda\Lambda^\top, Λ\Lambda row-stochastic, possibly time-varying/adaptive.
  • Domain boundedness: xkxiD\|x^k - x_i^*\| \le D (or appropriate relaxations for unbounded cases).

The design of the mixing matrix Λ\Lambda and resulting WW directly implements gradient filtering, limiting contribution to those with sufficiently similar data/tasks.

6. Empirical Evaluation

All2All has been evaluated on collaborative mean estimation tasks for Bernoulli parameters. The experimental configuration uses N=100N=100 agents, each with piUnif[0,1]p_i \sim \text{Unif}[0,1], locally drawing 10310^3 samples from Bernoulli(pi)\text{Bernoulli}(p_i). The method is compared to:

  • No-collaboration: Each agent runs local SGD.
  • Single global model: Centralized SGD/FedAvg across agents.
  • All2All: Filtering method with optimal λ\lambda.

Performance is measured by average local mean squared error, 1Ni(xitpi)2/2\frac{1}{N}\sum_i (x_i^t-p_i)^2/2. All2All demonstrates superior non-asymptotic decay, offering variance reduction by collaboration and achieving lower asymptotic MSE via effective bias–variance management. The protocol retains robustness to moderate noise in bias estimates bijb_{ij}.

7. Synthesis and Context

All2All is a general, sample-optimal collaborative training protocol for non-identically distributed environments, premised on personalized aggregation of stochastically filtered gradients. By constructing the mixing matrix to filter updates from “close” agents, All2All provably attains the optimal bias–variance trade-off, with empirical results confirming practical performance and robustness. This provides a unifying perspective connecting classical decentralized optimization, collaborative learning, and personalized federated learning, situating All2All as an effective strategy under standard regularity and similarity assumptions (Even et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to All2All Training Strategy.