Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stochastic First-Order Method with Client Sampling

Updated 3 February 2026
  • The paper demonstrates that adaptive client sampling reduces estimator variance and accelerates convergence by using probabilistic selection of clients.
  • The paper leverages online mirror descent and importance sampling strategies to adaptively tune client weights for improved scalability in heterogeneous environments.
  • The paper provides rigorous convergence analysis and empirical evidence showing reduced communication and computation costs in federated and decentralized settings.

A stochastic first-order method with client sampling refers to a class of optimization algorithms where, at each iteration, a subset of clients (e.g., data holders in federated or decentralized environments) is selected according to some probabilistic rule, and stochastic updates such as gradients or local model differences are collected only from these clients. This paradigm is foundational in federated learning (FL), distributed optimization, and large-scale decentralized machine learning, offering substantial reductions in communication and computation costs compared to full participation. Methodological advances in client sampling have demonstrated both theoretical and empirical acceleration of convergence, improved scalability, and enhanced robustness to data heterogeneity.

1. Mathematical Formulation of Stochastic First-Order Methods with Client Sampling

Consider a distributed optimization problem across MM clients, each with a local objective Fm(w)=φ(w;Dm)F_m(w) = \varphi(w; \mathcal{D}_m) and nonnegative weight λm\lambda_m (commonly λm=nm/ini\lambda_m = n_m / \sum_i n_i for local dataset sizes nmn_m). The global objective is:

F(w)=m=1MλmFm(w).F(w) = \sum_{m=1}^M \lambda_m F_m(w).

In the standard setting, at communication round tt:

  • The server broadcasts model wtw^t to a randomly selected subset StS^t of KK clients, with sampling probabilities ptΔM1p^t \in \Delta_{M-1} (probability simplex).
  • Each chosen client mStm \in S^t returns a local stochastic first-order update gmtg_m^t (e.g., gradient, local difference, or local solver output).
  • The server constructs an unbiased estimate:

gt=1KmStλmpmtgmt,g^t = \frac{1}{K} \sum_{m \in S^t} \frac{\lambda_m}{p_m^t} g_m^t,

  • and performs the update wt+1=wtμtgtw^{t+1} = w^t - \mu^t g^t.

Distinct variants include local training (R>1)(R>1), decentralized updates, and double-level sampling (clients and data points) (Zhao et al., 2021, Tyurin et al., 2022, Grudzień et al., 2022, Chen et al., 27 Jan 2026, Chen et al., 2020).

2. Client Sampling Strategies and Their Variance Properties

Client sampling directly controls the variance of the global update estimator and, thus, the convergence rate. The variance under independent sampling with probabilities pmtp_m^t is:

Var[gt]=1K(mλm2gmt2pmtmλmgmt2).\operatorname{Var}[g^t] = \frac{1}{K} \left( \sum_m \frac{\lambda_m^2 \|g_m^t\|^2}{p_m^t} - \left\|\sum_m \lambda_m g_m^t\right\|^2 \right).

Minimizing mamt/pmt\sum_m a_m^t / p_m^t with amt=λm2gmt2a_m^t = \lambda_m^2 \|g_m^t\|^2 over ptΔM1p^t \in \Delta_{M-1} reduces the estimator variance, especially important in heterogeneous and non-IID regimes.

Adaptive and optimal sampling strategies—including importance sampling (pmλmgmp_m \propto \lambda_m \|g_m\|) and online learning-based methods—achieve provable reductions over uniform schemes (Chen et al., 2020, Zhao et al., 2021). In decentralized settings, client sampling can be uniform to ensure only O(1)O(1) first-order oracle or communication cost per round (Chen et al., 27 Jan 2026).

3. Adaptive Algorithms: Online Mirror Descent and Bandit Feedback

The adaptive client sampling framework is naturally cast as an online convex optimization problem with bandit (partial-information) feedback. At round tt:

  1. The server chooses ptp^t.
  2. An adversarial loss t(p)=(1/K)m=1Mamt/pm\ell_t(p) = (1/K) \sum_{m=1}^M a_m^t / p_m is realized, but only gmtg_m^t for mStm \in S^t is observed.
  3. The server uses unbiased estimators (e.g., via counts N{mSt}N\{m \in S^t\}) to update ptp^t by online stochastic mirror descent (OSMD) with negative entropy or other mirror maps.

The step:

p~mt+1=p^mtexp{ηtamtK2(p^mt)3N{mSt}}\tilde{p}_m^{t+1} = \hat{p}_m^t \cdot \exp\left\{ \eta_t \cdot \frac{a_m^t}{K^2 (\hat{p}_m^t)^3} N\{m \in S^t\} \right\}

followed by Bregman projection onto the simplex with lower bounds, yields adaptivity to time-varying client importance and data heterogeneity (Zhao et al., 2021). Adaptive OSMD-based sampling consistently outperforms uniform and earlier bandit-based methods both theoretically and empirically.

4. Convergence Analysis and Complexity Results

The impact of client sampling appears explicitly in convergence rates through both the variance parameters and induced smoothness measures:

  • SGD / FedAvg: For strongly convex or nonconvex objectives, adaptive sampling achieves faster convergence by reducing the dynamic regret (difference between per-round loss under adaptive vs. best fixed sampling in hindsight):

D-RegretT(q1:T)=E[t=1Tt(p^t)t=1Tt(qt)]\text{D-Regret}_T(q^{1:T}) = \mathbb{E} \left[\sum_{t=1}^T \ell_t(\hat{p}^t) - \sum_{t=1}^T \ell_t(q^t)\right]

with rigorous bounds incorporating total variation and per-round gradients (Zhao et al., 2021).

  • PAGE-type methods: The total gradient complexity for reaching E[f(x^)2]ϵ\mathbb{E}[\|\nabla f(\hat{x})\|^2] \leq \epsilon with arbitrary sampling is

N=Θ(n+Δ0ϵS[L+nS((AB)L+,w2+BL±,w2)]),N = \Theta\left(n + \frac{\Delta_0}{\epsilon} \|S\| \left[L_- + \sqrt{\frac{n}{\|S\|} ((A-B) L_{+,w}^2 + B L_{\pm,w}^2 ) } \right]\right),

where (A,B,w)(A,B,w) capture the sampling variance structure (Tyurin et al., 2022).

  • Decentralized Nonconvex: In settings with Lipschitz nonsmooth objectives, single-client-per-round sampling achieves optimal O(δ1ϵ3)O(\delta^{-1} \epsilon^{-3}) sample and computation complexity, and nearly optimal O(γ1/2δ1ϵ3)O(\gamma^{-1/2} \delta^{-1} \epsilon^{-3}) communication complexity, where γ\gamma is the spectral gap of the network (Chen et al., 27 Jan 2026).

These results demonstrate that nonuniform, variance-minimizing sampling yields strictly improved rates over uniform random sampling, especially as heterogeneity grows.

5. Algorithmic Realizations and Practical Guidance

A variety of stochastic first-order algorithms can be directly equipped with nonuniform, adaptive client sampling:

Method Key Sampling Strategy Complexity Impact/Empirical Gain
OSMD-based (FL) Online bandit mirror descent on client weights Reduces variance, accelerates FL convergence
PAGE (flexible) Arbitrary unbiased client/data sampling (AB-ineq) Sharper complexity bounds, optimality for ϵ\epsilon
5GCS (primal-dual) Uniform cohort sampling (partial participation) Allows acceleration, preserves optimal rates
Importance Sampling pmwmΔmp_m \propto w_m \|\Delta_m\| (closed-form) Matches full participation with minimal rounds
DOCS (decentralized) Single sampled client per step, fast gossip O(1)O(1) SFO/step, communication-optimal

In FL, importance sampling and adaptive bandit schemes can be implemented efficiently via secure aggregation for privacy and statelessness. The aggregated client update norms suffice for decentralized calculation of probabilities; individual Δi\|\Delta_i\| are not exposed (Chen et al., 2020). Adaptive methods are robust to the parameter α\alpha in simplex lower bounds (Zhao et al., 2021).

In decentralized settings, client sampling and multi-consensus (e.g., fast Chebyshev-accelerated gossip) ensure that consensus error remains controlled even if only one client updates per round (Chen et al., 27 Jan 2026).

6. Advanced Extensions: Local Training, Double Sampling, and Heterogeneous Systems

Recent advances demonstrate that accelerated local training (multiple local steps per round) can be made compatible with partial client participation. The 5GCS method (5th-generation client sampling) introduces a primal–dual scheme supporting both local training and uniform client sampling, achieving the accelerated communication complexity O((M/τ+(M/τ)(L/μ))log(1/ε))\mathcal{O}((M/\tau + \sqrt{(M/\tau)(L/\mu)}) \log(1/\varepsilon)) for strongly convex and smooth FL objectives (Grudzień et al., 2022). Theoretical analysis leverages a bias-corrected estimator via the client-sampling operator, with carefully tuned Lyapunov functions to control error propagation from partial participation.

Double-level sampling, as in PAGE-type or AB-inequality frameworks, allows arbitrary compositions of client and within-client data subsampling, with the complexity analysis factoring each layer explicitly. Practical optimization involves balancing the inter-client variance (sample more clients if high) against intra-client variance (sample more data per client if high), with importance weights reflecting the dominant source (Tyurin et al., 2022).

Heterogeneous solvers (e.g., SVRG, SGD, or custom local solvers) are supported as long as their error contracts meet certain prox-skip bounds (Grudzień et al., 2022).

7. Empirical Performance and Practical Considerations

Empirical results in FL (MNIST, KMNIST, Fashion-MNIST; CIFAR-100; FEMNIST; Shakespeare) and decentralized settings (nonconvex SVM, MLP, and ResNet-18) consistently confirm that adaptive or optimal client sampling achieves:

  • Faster reduction in training loss and dynamic regret per round compared to uniform and older bandit samplers
  • Stability w.r.t. tuning parameters (e.g., OSMD’s α\alpha) except at extremes (Zhao et al., 2021)
  • Strictly lower communication for a given accuracy relative to full participation—with bits transmitted often cut by 8×8\times (Chen et al., 2020)
  • Preservation of privacy and support for stateless operation through aggregate-only protocols

A plausible implication is that, as client heterogeneity increases, adaptive sampling becomes increasingly critical in mitigating communication/computation bottlenecks without sacrificing convergence speed or privacy.


Key References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stochastic First-Order Method with Client Sampling.