Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (2512.17131v1)

Published 18 Dec 2025 in cs.LG, cs.AI, and stat.ML

Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.

Summary

The paper introduces GPA, a generalized primal averaging framework that decouples interpolation parameters to enhance training speed and improve convergence stability.
Empirical results show GPA achieves up to a 24.22% speedup on Llama-160M and significant improvements on vision tasks compared to AdamW and DiLoCo.
GPA reduces memory overhead and hyperparameter tuning complexity, making it a practical optimizer for both single-worker and distributed learning scenarios.

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Introduction and Motivation

The efficiency and scalability of pre-training LLMs have become increasingly critical due to resource costs and model size. Optimizer innovations have been central to accelerating training, particularly in distributed or non-distributed contexts for dense models. The DiLoCo algorithm—originally designed for distributed cross-datacenter training—demonstrates surprising efficacy even in single-worker regimes, outperforming strong baselines such as AdamW. However, DiLoCo introduces algorithmic overheads in memory and hyperparameter complexity due to its hierarchical (two-loop) structure and coupling between optimization and communication frequency.

Recent work on Schedule-Free optimization provides improved theoretical guarantees by leveraging uniform averaging, but its reliance on strict averaging limits practical tuning and performance. The presented paper introduces Generalized Primal Averaging (GPA), a flexible primal averaging framework which decouples averaging coefficients, enabling smooth and efficient model updates at each step. GPA subsumes and generalizes both DiLoCo and Schedule-Free, offering practical and theoretical advantages in LLM training.

Algorithmic Formulation: GPA as a Generalization

GPA stems from the primal averaging formulation of Nesterov’s momentum, in which two sequences are defined: one for gradient computation and another for model evaluation. Its key innovation is to introduce separate interpolation parameters, $\mu_x$ for the evaluation sequence and $\mu_y$ for the gradient sequence:

$\begin{aligned} y^{(t)} & = \mu_y x^{(t)} + (1 - \mu_y) z^{(t)} \ z^{(t+1)} & = z^{(t)} - \gamma^{(t)} g(y^{(t)}; \xi^{(t)}) \ x^{(t+1)} & = \mu_x x^{(t)} + (1 - \mu_x) z^{(t+1)} \end{aligned}$

The decoupling of $\mu_x$ and $\mu_y$ allows GPA to finely control smoothing and recency in the averaging, unifying and extending the capabilities of DiLoCo (which corresponds to periodic, coupled averaging) and Schedule-Free (which corresponds to uniform averaging with schedule-free learning). This generalization removes the two-loop structure, reduces the model buffer requirement to a single additional buffer, simplifies hyperparameter tuning, and yields more stable training dynamics.

Comparative Empirical Results

The paper evaluates GPA extensively on the Llama-160M and Llama-1B LLM pre-training tasks, as well as ViT-S/16 training on ImageNet in both small and large batch scenarios.

LLM Training: GPA provides a 24.22% speedup in steps to reach AdamW's validation loss on Llama-160M and outperforms both AdamW and DiLoCo across a range of inner step configurations. For Llama-1B, GPA yields improved final validation losses consistently, especially as the number of inner steps increases.

Figure 1: Both GPA and single-worker DiLoCo with AdamW base optimizer outperform AdamW baseline on a 160M parameter Llama model, with GPA matching inner steps via heuristic interpolation.

Figure 2: Validation loss comparison for AdamW, DiLoCo, and GPA with a fixed number of inner steps ( $H = 32$ ) highlights GPA's faster convergence for LLM workloads.

Vision Tasks: On ImageNet training with ViT-S/16, GPA achieves 12% and 27% speedup for small (4,096) and large (16,384) batch sizes, respectively, to attain AdamW baseline validation accuracy.

Figure 3: GPA versus AdamW on ImageNet ViT-S/16 with augmentations at batch size 4,096—GPA yields higher accuracy throughout training.

Figure 4: GPA versus AdamW on ImageNet ViT-S/16 with augmentations at batch size 16,384—GPA demonstrates superior accuracy and stability.

Across all experiments, GPA exhibits smoother and more stable training curves, with the ability to tolerate higher learning rates than DiLoCo or AdamW. Sensitivity analyses confirm the need for decoupled interpolation constants; coupling them as in standard Nesterov formulations is demonstrably suboptimal.

Theoretical Guarantees and Convergence Analysis

A central strength of the paper is its theoretical justification for GPA. Under convexity and stochastic settings, if the base optimizer achieves a regret bound of $\mathcal{O}(\sqrt{T})$ , GPA matches or exceeds the original optimizer's convergence rate for the average iterate, depending on the choice of $\mu_x$ and $\mu_y$ . Negative Bregman divergence terms in the derived bounds provide the potential for accelerated last-iterate convergence, especially when objective function variations are nonlinear between consecutive iterates.

This decoupled structure also clarifies the dependence between smoothing (via $\mu_x$ ) and information flow (via $\mu_y$ ). The analysis underscores why increasing inner steps in DiLoCo can yield non-intuitive improvement—a phenomenon GPA enables to be tuned continuously through its parameters rather than discretely.

Algorithmic, Practical, and Distributed Implications

From an implementation perspective, GPA requires only a single additional buffer and significantly fewer hyperparameters than DiLoCo, reducing both tuning and memory overhead. Its theoretical guarantees and empirical stability suggest wider compatibility with advanced base optimizers, including Shampoo, SOAP, and Muon, and facilitate the use of modular norm theory for hyperparameter transfer across architectures.

GPA’s continuous smoothing parameter also decouples local SGD inner steps and momentum, enabling redesign of distributed algorithm frameworks such as DiLoCo for cross-region training. This flexibility is expected to be beneficial in federated, asynchronous, or high-latency distributed regimes.

Figures on Optimization Dynamics

Further experimental results illuminate the superior learning dynamics of GPA compared to AdamW and DiLoCo for a range of hyperparameters and inner step counts.

Figure 5: Validation loss trajectory for GPA, DiLoCo, and AdamW for effective inner steps $H = 8$ (left) and $H = 16$ (right); GPA maintains lower loss and smoother convergence.

(Figure 6), (Figure 7), (Figure 8)

Figure 6: AdamW ( $\beta_1$ vs $\beta_2$ ) hyperparameter sweep.

Figure 7: GPA-AdamW ( $\mu_y$ vs $\mu_x$ ) hyperparameter landscape.

Figure 8: DiLoCo-AdamW ( $\gamma$ vs $\tilde{\gamma}$ , $\mu$ ) showing interaction and performance tuning.

Conclusion

GPA provides a principled and practical generalization of primal averaging, outperforming both AdamW and single-worker DiLoCo on dense LLM and vision workloads, while reducing memory and tuning complexity. Its decoupled parameters afford continuous and theoretically grounded control over model update recency and smoothing, explaining and extending previous optimizer performance gains. The stability and efficiency of GPA position it as a highly competitive approach for pre-training LLMs, with promising implications for large-scale distributed optimization and compatibility with emerging optimizers.

Further study should probe GPA's convergence and generalization dynamics in non-convex regimes, extend empirical validation across architectures and modalities, and investigate hyperparameter transfer, distributed design, and optimizer stacking scenarios. The framework’s monotonicity, stability, and accelerated learning mark an advance in large-scale model optimization (2512.17131).