Stochastic First-Order Method with Client Sampling

Updated 3 February 2026

The paper demonstrates that adaptive client sampling reduces estimator variance and accelerates convergence by using probabilistic selection of clients.
The paper leverages online mirror descent and importance sampling strategies to adaptively tune client weights for improved scalability in heterogeneous environments.
The paper provides rigorous convergence analysis and empirical evidence showing reduced communication and computation costs in federated and decentralized settings.

A stochastic first-order method with client sampling refers to a class of optimization algorithms where, at each iteration, a subset of clients (e.g., data holders in federated or decentralized environments) is selected according to some probabilistic rule, and stochastic updates such as gradients or local model differences are collected only from these clients. This paradigm is foundational in federated learning (FL), distributed optimization, and large-scale decentralized machine learning, offering substantial reductions in communication and computation costs compared to full participation. Methodological advances in client sampling have demonstrated both theoretical and empirical acceleration of convergence, improved scalability, and enhanced robustness to data heterogeneity.

1. Mathematical Formulation of Stochastic First-Order Methods with Client Sampling

Consider a distributed optimization problem across $M$ clients, each with a local objective $F_m(w) = \varphi(w; \mathcal{D}_m)$ and nonnegative weight $\lambda_m$ (commonly $\lambda_m = n_m / \sum_i n_i$ for local dataset sizes $n_m$ ). The global objective is:

$F(w) = \sum_{m=1}^M \lambda_m F_m(w).$

In the standard setting, at communication round $t$ :

The server broadcasts model $w^t$ to a randomly selected subset $S^t$ of $K$ clients, with sampling probabilities $p^t \in \Delta_{M-1}$ (probability simplex).
Each chosen client $m \in S^t$ returns a local stochastic first-order update $g_m^t$ (e.g., gradient, local difference, or local solver output).
The server constructs an unbiased estimate:

$g^t = \frac{1}{K} \sum_{m \in S^t} \frac{\lambda_m}{p_m^t} g_m^t,$

and performs the update $w^{t+1} = w^t - \mu^t g^t$ .

Distinct variants include local training $(R>1)$ , decentralized updates, and double-level sampling (clients and data points) (Zhao et al., 2021, Tyurin et al., 2022, Grudzień et al., 2022, Chen et al., 27 Jan 2026, Chen et al., 2020).

2. Client Sampling Strategies and Their Variance Properties

Client sampling directly controls the variance of the global update estimator and, thus, the convergence rate. The variance under independent sampling with probabilities $p_m^t$ is:

$\operatorname{Var}[g^t] = \frac{1}{K} \left( \sum_m \frac{\lambda_m^2 \|g_m^t\|^2}{p_m^t} - \left\|\sum_m \lambda_m g_m^t\right\|^2 \right).$

Minimizing $\sum_m a_m^t / p_m^t$ with $a_m^t = \lambda_m^2 \|g_m^t\|^2$ over $p^t \in \Delta_{M-1}$ reduces the estimator variance, especially important in heterogeneous and non-IID regimes.

Adaptive and optimal sampling strategies—including importance sampling ( $p_m \propto \lambda_m \|g_m\|$ ) and online learning-based methods—achieve provable reductions over uniform schemes (Chen et al., 2020, Zhao et al., 2021). In decentralized settings, client sampling can be uniform to ensure only $O(1)$ first-order oracle or communication cost per round (Chen et al., 27 Jan 2026).

3. Adaptive Algorithms: Online Mirror Descent and Bandit Feedback

The adaptive client sampling framework is naturally cast as an online convex optimization problem with bandit (partial-information) feedback. At round $t$ :

The server chooses $p^t$ .
An adversarial loss $\ell_t(p) = (1/K) \sum_{m=1}^M a_m^t / p_m$ is realized, but only $g_m^t$ for $m \in S^t$ is observed.
The server uses unbiased estimators (e.g., via counts $N\{m \in S^t\}$ ) to update $p^t$ by online stochastic mirror descent (OSMD) with negative entropy or other mirror maps.

The step:

$\tilde{p}_m^{t+1} = \hat{p}_m^t \cdot \exp\left\{ \eta_t \cdot \frac{a_m^t}{K^2 (\hat{p}_m^t)^3} N\{m \in S^t\} \right\}$

followed by Bregman projection onto the simplex with lower bounds, yields adaptivity to time-varying client importance and data heterogeneity (Zhao et al., 2021). Adaptive OSMD-based sampling consistently outperforms uniform and earlier bandit-based methods both theoretically and empirically.

4. Convergence Analysis and Complexity Results

The impact of client sampling appears explicitly in convergence rates through both the variance parameters and induced smoothness measures:

SGD / FedAvg: For strongly convex or nonconvex objectives, adaptive sampling achieves faster convergence by reducing the dynamic regret (difference between per-round loss under adaptive vs. best fixed sampling in hindsight):

$\text{D-Regret}_T(q^{1:T}) = \mathbb{E} \left[\sum_{t=1}^T \ell_t(\hat{p}^t) - \sum_{t=1}^T \ell_t(q^t)\right]$

with rigorous bounds incorporating total variation and per-round gradients (Zhao et al., 2021).

PAGE-type methods: The total gradient complexity for reaching $\mathbb{E}[\|\nabla f(\hat{x})\|^2] \leq \epsilon$ with arbitrary sampling is

$N = \Theta\left(n + \frac{\Delta_0}{\epsilon} \|S\| \left[L_- + \sqrt{\frac{n}{\|S\|} ((A-B) L_{+,w}^2 + B L_{\pm,w}^2 ) } \right]\right),$

where $(A,B,w)$ capture the sampling variance structure (Tyurin et al., 2022).

Decentralized Nonconvex: In settings with Lipschitz nonsmooth objectives, single-client-per-round sampling achieves optimal $O(\delta^{-1} \epsilon^{-3})$ sample and computation complexity, and nearly optimal $O(\gamma^{-1/2} \delta^{-1} \epsilon^{-3})$ communication complexity, where $\gamma$ is the spectral gap of the network (Chen et al., 27 Jan 2026).

These results demonstrate that nonuniform, variance-minimizing sampling yields strictly improved rates over uniform random sampling, especially as heterogeneity grows.

5. Algorithmic Realizations and Practical Guidance

A variety of stochastic first-order algorithms can be directly equipped with nonuniform, adaptive client sampling:

Method	Key Sampling Strategy	Complexity Impact/Empirical Gain
OSMD-based (FL)	Online bandit mirror descent on client weights	Reduces variance, accelerates FL convergence
PAGE (flexible)	Arbitrary unbiased client/data sampling (AB-ineq)	Sharper complexity bounds, optimality for $\epsilon$
5GCS (primal-dual)	Uniform cohort sampling (partial participation)	Allows acceleration, preserves optimal rates
Importance Sampling	$p_m \propto w_m \\|\Delta_m\\|$ (closed-form)	Matches full participation with minimal rounds
DOCS (decentralized)	Single sampled client per step, fast gossip	$O(1)$ SFO/step, communication-optimal

In FL, importance sampling and adaptive bandit schemes can be implemented efficiently via secure aggregation for privacy and statelessness. The aggregated client update norms suffice for decentralized calculation of probabilities; individual $\|\Delta_i\|$ are not exposed (Chen et al., 2020). Adaptive methods are robust to the parameter $\alpha$ in simplex lower bounds (Zhao et al., 2021).

In decentralized settings, client sampling and multi-consensus (e.g., fast Chebyshev-accelerated gossip) ensure that consensus error remains controlled even if only one client updates per round (Chen et al., 27 Jan 2026).

6. Advanced Extensions: Local Training, Double Sampling, and Heterogeneous Systems

Recent advances demonstrate that accelerated local training (multiple local steps per round) can be made compatible with partial client participation. The 5GCS method (5th-generation client sampling) introduces a primal–dual scheme supporting both local training and uniform client sampling, achieving the accelerated communication complexity $\mathcal{O}((M/\tau + \sqrt{(M/\tau)(L/\mu)}) \log(1/\varepsilon))$ for strongly convex and smooth FL objectives (Grudzień et al., 2022). Theoretical analysis leverages a bias-corrected estimator via the client-sampling operator, with carefully tuned Lyapunov functions to control error propagation from partial participation.

Double-level sampling, as in PAGE-type or AB-inequality frameworks, allows arbitrary compositions of client and within-client data subsampling, with the complexity analysis factoring each layer explicitly. Practical optimization involves balancing the inter-client variance (sample more clients if high) against intra-client variance (sample more data per client if high), with importance weights reflecting the dominant source (Tyurin et al., 2022).

Heterogeneous solvers (e.g., SVRG, SGD, or custom local solvers) are supported as long as their error contracts meet certain prox-skip bounds (Grudzień et al., 2022).

7. Empirical Performance and Practical Considerations

Empirical results in FL (MNIST, KMNIST, Fashion-MNIST; CIFAR-100; FEMNIST; Shakespeare) and decentralized settings (nonconvex SVM, MLP, and ResNet-18) consistently confirm that adaptive or optimal client sampling achieves:

Faster reduction in training loss and dynamic regret per round compared to uniform and older bandit samplers
Stability w.r.t. tuning parameters (e.g., OSMD’s $\alpha$ ) except at extremes (Zhao et al., 2021)
Strictly lower communication for a given accuracy relative to full participation—with bits transmitted often cut by $8\times$ (Chen et al., 2020)
Preservation of privacy and support for stateless operation through aggregate-only protocols

A plausible implication is that, as client heterogeneity increases, adaptive sampling becomes increasingly critical in mitigating communication/computation bottlenecks without sacrificing convergence speed or privacy.

Key References:

"Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback" (Zhao et al., 2021)
"Sharper Rates and Flexible Framework for Nonconvex SGD with Client and Data Sampling" (Tyurin et al., 2022)
"Can 5th Generation Local Training Methods Support Client Sampling? Yes!" (Grudzień et al., 2022)
"Decentralized Nonsmooth Nonconvex Optimization with Client Sampling" (Chen et al., 27 Jan 2026)
"Optimal Client Sampling for Federated Learning" (Chen et al., 2020)