Generalized Primal Averaging (GPA)

Updated 22 December 2025

Generalized Primal Averaging (GPA) is an optimization technique that uses weighted and exponential averaging of primal iterates to improve convergence in first-order and primal-dual algorithms.
GPA integrates concepts such as dual averaging, Nesterov momentum, and stochastic optimization, thereby accelerating convergence and reducing communication rounds in distributed settings.
Practical implementations of GPA demonstrate enhanced scalability and robustness in large-scale machine learning, achieving significant speedups over traditional methods.

Generalized Primal Averaging (GPA) is a family of optimization techniques that implement weighted or exponential averaging of primal iterates within first-order or primal-dual algorithms. GPA arises in both distributed and non-distributed settings, seeking to combine strong theoretical guarantees with simplicity, acceleration, and communication-efficiency. GPA interfaces tightly with dual averaging, Nesterov-type momentum, and modern stochastic optimization, and underpins the convergence theory of several widely-used optimizers in machine learning.

1. Mathematical Formulation and Core Principles

GPA algorithms maintain one or more sequences of primal variables, updating them via recursive weighted averaging of model iterates or search directions. A basic GPA update for iterates $x^{(t)}$ is

$x^{(t+1)} = \mu_x \, x^{(t)} + (1-\mu_x) \, z^{(t+1)}$

where $\mu_x \in [0,1)$ denotes the exponential moving average (EMA) coefficient and $z^{(t+1)}$ is the update from the base optimizer. Typical implementations also feature a "gradient computation" point,

$y^{(t)} = \mu_y \, x^{(t)} + (1-\mu_y) \, z^{(t)},$

decoupling the score at which gradients are computed ( $y^{(t)}$ ) from where the model evaluation occurs. This decoupling generalizes classical averaging schemes such as Polyak–Ruppert, Nesterov acceleration, and momentum (Defazio et al., 18 Dec 2025, Defazio, 2020).

In distributed convex optimization, GPA manifests in frameworks such as CoCoA⁺ (Ma et al., 2015), where local subproblems yield primal and dual updates that are aggregated globally with tunable weight. In constrained convex programs, primal recovery is effected by weighted Fenchel-type averaging over primal approximations (Tran-Dinh, 2015).

2. Distributed Primal-Dual Algorithms and Communication Efficiency

In the distributed setting, GPA is exemplified by the CoCoA⁺ framework for empirical risk minimization (Ma et al., 2015). The primal problem is

$\min_{w\in\mathbb R^d} \quad \frac{1}{n} \sum_{i=1}^n \ell_i(w^T x_i) + \frac{\lambda}{2}\|w\|^2,$

with dual

$\max_{\alpha\in\mathbb R^n} \quad -\frac{1}{n}\sum_{i=1}^n \ell_i^*(-\alpha_i) - \frac{\lambda}{2} \left\| \frac{1}{\lambda n} A \alpha \right\|^2.$

GPA-type global update rules interpolate between conservative averaging,

$\alpha^{(t+1)} = \alpha^{(t)} + \frac{1}{K} \sum_{k=1}^K \Delta\alpha_{[k]}^{(t)}$

and aggressive (additive) aggregation,

$\alpha^{(t+1)} = \alpha^{(t)} + \sum_{k=1}^K \Delta\alpha_{[k]}^{(t)},$

where $K$ is the number of workers. Setting the aggregation parameter $\nu=1$ yields the GPA/CoCoA⁺ variant, which provably achieves iteration complexity independent of $K$ under suitable quadratic-separability conditions.

Empirically, CoCoA⁺ converges in roughly half the communication rounds of the original CoCoA. Strong scaling is observed in large clusters; increasing $K$ can even decrease time-to-accuracy, whereas naive averaging degrades linearly with $K$ . Safe settings ( $\nu=1$ , $\sigma'=K$ ) require no tuning and are robust across data regimes (Ma et al., 2015).

3. Weighted Primal Averaging and Accelerated Primal-Dual Algorithms

GPA is foundational in alternating minimization algorithms (AMA) for linearly constrained convex programs (Tran-Dinh, 2015). Consider

$\min_{u\in U, v\in V} \quad g(u) + h(v) \quad \text{s.t.} \quad Au + Bv = c,$

with no strong convexity required. The primal recovery step computes weighted averages over primal candidates,

$\bar{u}^k = \frac{1}{S_k}\sum_{i=0}^k w_i \tilde{u}^i, \quad \bar{v}^k = \frac{1}{S_k}\sum_{i=0}^k w_i \tilde{v}^i,$

where $w_i$ are positive weights (such as dual step-sizes or accelerated weights).

Non-accelerated primal-dual AMA with GPA yields an $\varepsilon$ -solution in $O(L D_U / \varepsilon^2)$ iterations, while accelerated variants achieve $O(\sqrt{L D_U} / \varepsilon)$ , both without requiring strong convexity. These rates are optimal for black-box first-order models (Nemirovskii–Yudin lower bounds) (Tran-Dinh, 2015).

4. Momentum, Non-Convex Optimization, and Lyapunov Analysis

GPA directly specializes to stochastic primal averaging (SPA) forms, which are equivalent to SGD with momentum ("heavy-ball") under a specific reparameterization (Defazio, 2020). The core update involves two sequences,

$\begin{aligned} z_{k+1} &= z_k - \eta_k \nabla f(x_k, \xi_k), \ x_{k+1} &= (1 - c_{k+1}) x_k + c_{k+1} z_{k+1}, \end{aligned}$

with the equivalence given by $\alpha_k = \eta_k c_{k+1}$ , $\beta_k = \frac{\eta_{k-1}(1-c_k)}{\eta_k}$ .

Lyapunov-based analysis yields tight $O(1/\sqrt{T})$ bounds in the non-convex stochastic setting, matching that of SGD but with reduced variance and improved constants. The negative term proportional to $\|x_k-x_{k-1}\|^2$ cancels a substantial portion of the noise, explaining the empirical advantage of momentum especially in early epochs.

Practical guidance includes simultaneous scheduling of learning rates and averaging weights: for example, upon decaying $\eta$ by a factor $\phi$ , increase $c$ by $\phi$ (i.e., decrease momentum coefficient). The marginal benefit of momentum vanishes after a few training stages.

5. GPA in Large-Scale Machine Learning and LLM Training

Recent work extends GPA to wrap modern optimizers (AdamW, etc.) via EMA, delivering strong empirical performance in non-distributed single-worker settings (Defazio et al., 18 Dec 2025). A canonical algorithm maintains three points: $y^{(t)}$ (gradient computation), $z^{(t)}$ (update), and $x^{(t)}$ (model average),

y(t) = μ_y * x(t) + (1–μ_y) * z(t)
g(t) = ∂f(y(t); ξ(t))
d(t) = BaseOpt(g(t))
z(t+1) = (1–γ(t)λ) * z(t) + γ(t) * d(t)
x(t+1) = μ_x * x(t) + (1–μ_x) * z(t+1)

where

\mu_x

and

\mu_y

are independent EMA coefficients. By unifying the averaging schemes of Schedule-Free (SF) and single-worker DiLoCo, GPA removes the need for two-loop structures and additional hyperparameters.

On Llama-160M and Llama-1B models, GPA-AdamW yields up to 24% steps-to-baseline-loss speedup compared to AdamW, and outperforms single-worker DiLoCo at optimal inner-steps. On ImageNet ViT workloads, similar speedups and top-1 accuracy gains are reported across batch sizes.

Theoretical guarantees show that GPA inherits the online regret bound $O(\sqrt{T})$ from its base optimizer and can further accelerate convergence depending on the values of $\mu_x$ and $\mu_y$ due to non-negative Bregman-divergence terms.

6. Hyperparameter Selection, Implementation, and Practical Guidelines

Implementing GPA requires storing at most one additional buffer beyond the base optimizer state, which can often be reduced further by implicit reconstruction ( $x = y/\mu_y + (1-1/\mu_y)z$ ). Standard hyperparameter schedules (AdamW betas, learning rates) can be retained. Tuning $\mu_x$ and $\mu_y$ translates directly from DiLoCo—given DiLoCo parameters $(H, \mu)$ , set $\mu_x \approx \mu^{1/H}$ , $\mu_y \approx \mu$ .

In distributed settings, choose aggregation $\nu=1$ , separability parameter $\sigma'=K$ , and local-solver steps $H$ to balance compute and communication. For unconstrained convex problems, weighted averaging with step-weight $w_k$ yields optimal convergence rates when choosing $\gamma = \varepsilon/D_U$ .

The online duality gap is recommended as a universal stopping criterion. No strong convexity is required for rate guarantees in constrained programs or distributed dual methods.

7. Limitations, Extensions, and Open Problems

GPA’s convergence guarantees in the non-distributed setting are presently limited to the averaged-iterate regime under convexity assumptions; extension to nonconvex last-iterate guarantees remains open (Defazio et al., 18 Dec 2025). Full compatibility with advanced base optimizers and cross-region distributed regimes demands further theoretical analysis. While GPA provides optimal rates and flexibility, the introduction of learning rate schedules may re-introduce complexity absent in pure uniform-averaging schemes.

In summary, GPA unifies and extends weighted averaging, momentum, and primal-dual update schemes, underpinning scalable, communication-efficient, and empirically superior optimization algorithms in modern machine learning, convex and nonconvex alike.