PALSGD: Pseudo-Asynchronous Local SGD

Updated 6 January 2026

PALSGD is a distributed optimization algorithm that uses probabilistic pseudo-synchronization to reduce communication delays while maintaining model consistency.
It performs multiple local SGD steps with random exponential moving average updates, enabling larger local update intervals with minimal accuracy loss.
Empirical evaluations on vision and language benchmarks demonstrate significant training speedups over traditional DDP, highlighting its scalability and efficiency.

Pseudo-Asynchronous Local SGD (PALSGD) is an algorithm for distributed stochastic optimization designed to address the communication bottlenecks encountered in large-scale data-parallel deep learning. PALSGD extends Local SGD and Distributed Low-Communication (DiLoCo) schemes by introducing a probabilistic, pseudo-synchronization strategy, allowing for substantial reductions in network communication while maintaining strong model consistency and convergence guarantees. On both vision and language benchmarks, PALSGD achieves substantial training speedups over traditional distributed data parallelism (DDP) with negligible accuracy degradation (Naganuma et al., 25 Apr 2025).

1. Motivation and Background

Distributed Data Parallel (DDP) frameworks speed up the training of deep models by distributing data and computation across multiple workers, each computing gradients on a local mini-batch, then synchronizing model updates via an All-Reduce operation at each step. As scale increases, All-Reduce can consume 20–30% of total training time, acting as a severe scalability bottleneck. Attempts to reduce synchronization frequency by increasing the global batch size can harm generalization capability.

Local SGD partly mitigates this by permitting $H > 1$ local update steps before global model averaging, reducing communication by a factor of $1/H$. However, increasing $H$ too much leads to model divergence and slower convergence. DiLoCo schemes attempt to further balance communication and stability using decoupled local/outer optimizers, but still rely on synchronous global synchronization and do not eliminate the attendant straggler effects and costs.

PALSGD introduces pseudo-asynchronous synchronization by allowing, within each block of $H$ local steps, random, local, exponentially-weighted averaging (pseudo-sync) toward a stale global model. This mechanism reduces the divergence between local and global models, enabling much larger $H$ values and drastically cutting synchronization overhead without sacrificing convergence.

2. Algorithmic Structure and Updates

PALSGD is parameterized by number of workers $K$ , synchronization interval $H$ , pseudo-sync probability $p\in(0,1)$ , inner optimizer step size $\alpha_t$ , and mixing rate $\eta_t$ . Each worker maintains a local parameter vector $x_k^{(t)}$ and a local copy of the most recent globally-synchronized model $x^{(t)}$ .

At each iteration $t$ :

With probability $p$ , the worker performs a pseudo-synchronization update:

$x_k^{(t)} \leftarrow x_k^{(t)} - \frac{\alpha_t \eta_t}{p}\,(x_k^{(t)} - x^{(t)})$

This step performs exponential moving average mixing toward the stale global model, using no network communication.

Otherwise, a standard gradient step is taken on local data using InnerOPT (e.g., SGD, AdamW) with step size $\alpha_t/(1-p)$ .

Every $H$ steps, a full All-Reduce operation aggregates worker states, and the globally averaged model is updated using an outer optimizer (e.g., Nesterov SGD):

$\Delta^{(t)} = \frac{1}{K}\sum_k (x^{(t-1)} - x_k^{(t)}),\quad x^{(t+1)} \leftarrow \text{OuterOPT}(x^{(t)}, \Delta^{(t)})$

Compared to fixed-interval Local SGD, which simply alternates $H$ local steps and an All-Reduce, PALSGD's probabilistic, intra-block pseudo-sync allows for much larger $H$ values by actively reducing local–global model drift between synchronizations.

3. Theoretical Convergence Analysis

PALSGD's convergence guarantees are established under the following assumptions: $L$ -smoothness of $f(x,\xi)$ , $\mu$ -strong convexity, i.i.d. local data, and bounded gradient variance at optimum. For $0 < p \le 1/2$ and suitable learning/mixing rates, the weighted average iterate

$\hat x_T = \frac{1}{Z_T}\sum_{t=0}^{T-1} w_t x^{(t)}$

attains the guarantee:

$\mathbb{E}[F(\hat x_T)] - F(x^*) = O\left(\frac{\sigma^2}{\mu K T} + \frac{(\kappa+p^{-1}H)\sigma^2}{\mu T^2}\right)$

where $\kappa = L/\mu$ . The first term recovers linear speedup in $K$ (matching Cramér–Rao parallel SGD), while the second term captures the additional consensus and pseudo-asynchrony error, which grows with both local drift ( $H$ ) and the inverse sync probability ( $p^{-1}$ ).

The proof combines recursions for the averaged iterate, bounds on the variance contributed by pseudo-sync steps, and consensus-drift recursions, culminating in a weighted telescoping argument.

4. Empirical Evaluation and Results

PALSGD was benchmarked on:

ImageNet-1K with ResNet-50 ( $\approx$ 25.6M params), Image classification
TinyStories with GPT-Neo-8M/125M, Language modeling
CIFAR-10 with VGG-16

Experimental setups involved multi-node clusters with varied GPU interconnects and homogeneous hardware. Hyperparameters were carefully tuned per method and task. Representative results include:

Task	Model	Workers	$H$	DDP time	PALSGD time	Speedup vs DDP	DiLoCo speedup
ImageNet-1K	ResNet-50	4	64	100	81.6	+18.4%	+6%
TinyStories	GPT-Neo125M	8	16	100	75.6	+24.4%	—
TinyStories	GPT-Neo-8M	8	16	100	78.9	+21.1%	—

PALSGD consistently maintained or improved final model accuracy relative to DDP and DiLoCo, with up to 93.75% fewer global synchronizations (for $H=16$ ). Sensitivity to local update interval $H$ was mild; for the VGG-16/CIFAR-10 case, increasing $H$ from $32$ to $1024$ caused only 0.058% accuracy degradation.

5. Trade-offs, Practical Recommendations, and Applicability

The essential trade-off in PALSGD is between communication cost (favoring large $H$ ) and model divergence (favoring small $H$ ). The pseudo-sync mechanism, controlled by $p$ and $\eta$ , enables larger $H$ values with negligible impact on convergence. Recommended hyperparameter choices, based on experiments:

$H$ : Select the maximal value that network/compute allows without harming validation loss (e.g., $H\approx64$ for vision, $H=16$ –$64$ for language).
$p$ : Small values (0.05–0.1) retain most of the benefit; larger $p$ may reduce learning progress.
$\eta$ : Set such that $\alpha\eta/p \approx O(1/H)$ (e.g., $\eta_t = p/(2H\alpha_t)$ ).
Warmup: Running DDP for $5$– $10\%$ of training before switching to PALSGD stabilizes optimization.

PALSGD proves most effective on high-latency or bandwidth-limited interconnects, large models with expensive All-Reduce, and in homogeneous hardware environments.

6. Limitations and Directions for Extension

Current theoretical guarantees for PALSGD are limited to strongly convex objectives and SGD/averaging optimizers; analysis of nonconvex problems and fully adaptive optimizers (e.g., Adam) remains open. Homogeneous cluster assumptions underlie the present analysis; adaptive mechanisms for heterogeneous or dynamic environments are a candidate direction for future work. Integration of pseudo-sync with gradient compression or decentralized averaging could yield further communication savings.

Overlap-Local-SGD (Wang et al., 2020) similarly seeks to reduce communication delays by introducing an anchor model and non-blocking All-Reduce operations, with local “pullback” steps toward the anchor model. Both methods aim to maintain model consensus with reduced synchronization, overlap local computation and communication, and recover linear parallel speedup.

Key distinctions:

PALSGD uses probabilistic, local exponential-moving-average synchronization events independent of communication (pseudo-sync), while Overlap-Local-SGD employs deterministic, anchor-based averaging.
Theoretical analyses differ: Overlap-LS leverages the spectral decay of mixing matrices, while PALSGD's bounds rely on consensus-drift and pseudo-async variance control.
Empirical behaviors diverge in non-IID data or high-straggler environments; for Overlap-LS, non-blocking threads can further mitigate latency, while PALSGD's focus is on reducing required global synchronizations.

A plausible implication is that PALSGD and Overlap-Local-SGD are complementary, and hybrid schemes (e.g., pseudo-sync plus decentralized anchors) may offer additive efficiencies in future distributed optimization frameworks.

References:

"Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training" (Naganuma et al., 25 Apr 2025)
"Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD" (Wang et al., 2020)