Gradient Tracking Diffusion Strategy

Updated 28 December 2025

Gradient tracking diffusion strategy is a method that integrates historical gradient information to stabilize and guide sampling in diffusion models.
The approach employs progressive likelihood warm-up and adaptive directional momentum smoothing to balance prior and likelihood gradients effectively.
Empirical results with SPGD show superior restoration performance in tasks like inpainting and deblurring, improving metrics such as PSNR, SSIM, and LPIPS.

Gradient Tracking Diffusion Strategy refers to a class of techniques—prominently in modern generative modeling and decentralized optimization—that incorporate, track, and manage gradient information within diffusion-based frameworks. These approaches aim either to stabilize, accelerate, or bias the generative or optimization trajectory by exploiting historical and/or structured gradients, rather than relying solely on pointwise or myopic updates. They are critical both in high-dimensional inverse problems (e.g., image restoration, hyperspectral covariance estimation) and in distributed learning settings.

1. Mathematical Formulation: Gradient Tracking in Diffusion Schemes

In the context of image restoration via diffusion models, the canonical framework involves Bayesian inference with a pre-trained unconditional diffusion prior and an explicit data (likelihood) constraint. Mathematically, the conditional score is decomposed as: $\nabla_{x_t} \log p(x_t|y) = \nabla_{x_t} \log p_\theta(x_t) + \nabla_{x_t} \log p(y|x_t)$ where:

$\nabla_{x_t} \log p_\theta(x_t) \approx -\epsilon_\theta(x_t,t)/\sqrt{1-\bar\alpha_t}$ is the learned prior score,
$\nabla_{x_t} \log p(y|x_t) \approx -\zeta \nabla_{x_t} \| y - A(\hat x_0(x_t)) \|_2^2$ is a DPS-style likelihood gradient, with $\hat x_0(x_t) = (x_t - \sqrt{1-\bar\alpha_t} \epsilon_\theta(x_t,t))/\sqrt{\bar\alpha_t}$ (Wu et al., 9 Jul 2025).

A single-step DDIM-type update with explicit gradient guidance is: $x_{t-1} = 1/\sqrt{\alpha_t}\, x_t - g_d(x_t) - \zeta g_l(x_t)$ with $g_d(x_t)$ and $g_l(x_t)$ representing scaled prior and likelihood gradients, respectively. This structure enables precise tracking and management of the contributions of each gradient component throughout the reverse diffusion process.

2. Instabilities in Naïve Gradient Guidance: Empirical Gradient Dynamics

Direct combination of prior and likelihood gradients in diffusion guidance leads to two key instabilities:

Direction conflict: Empirically, the angle between $g_d(x_t)$ and $g_l(x_t)$ can deviate significantly from orthogonality, especially in early steps, leading to update directions that oppose each other. This degrades the effectiveness of both priors (Wu et al., 9 Jul 2025).
Temporal fluctuation: The likelihood gradient direction $g_l(x_t)$ can vary erratically between consecutive timesteps (large angle between $g_l(x_t)$ and $g_l(x_{t+1})$ ), injecting high-frequency noise into the sampling trajectory. This non-smoothness often manifests as restoration artifacts or stalling.

Both phenomena are quantitatively observed in angular statistics along the reverse trajectory and are directly linked to suboptimal recovery quality.

3. Stabilized Progressive Gradient Diffusion (SPGD): Algorithmic Strategy

To overcome these instabilities, Stabilized Progressive Gradient Diffusion (SPGD) introduces two intertwined mechanisms:

Progressive Likelihood Warm-Up: Instead of applying the full likelihood gradient in a single step, the update is split into $N$ smaller sub-steps per diffusion iteration ( $x_t^{(j+1)} = x_t^{(j)} - (\zeta/N)\, \tilde{g}_l(x_t^{(j)})$ ). This ensures that the likelihood term is adaptively introduced, mitigating abrupt conflicts with the prior and gradually aligning directions (Wu et al., 9 Jul 2025).
Adaptive Directional Momentum (ADM) Smoothing: Within the inner loop, the raw likelihood gradient $g_l$ is smoothed by a momentum accumulator that is adaptively weighted based on the directional cosine similarity between successive gradients:

$\tilde{g}_l^{(j)} = \alpha_j \beta \tilde{g}_l^{(j-1)} + (1 - \alpha_j \beta) g_l(x_t^{(j)})$

where $\alpha_j = (\text{sim}(\tilde{g}_l^{(j-1)}, g_l(x_t^{(j)})) + 1)/2$ , and $\beta$ is the base momentum. Perfect alignment results in standard momentum, while disagreement damps the accumulation, reducing erratic propagation.

These operations together yield a robust, smooth, and directionally stable update path toward the data-consistent solution.

4. Implementation: SPGD Sampling Procedure

Below is an outline of the SPGD diffusion sampling scheme (Wu et al., 9 Jul 2025):

Input: y, A, ε_θ, T (outer steps), N (inner steps), ζ (likelihood strength), β (momentum)
Initialize x_T ~ N(0,I)
For t = T downto 1:
    Set x_t^(0) = x_t
    For j = 0 to N-1:
        Compute g_l = ∇_x || y - A( \hat x_0(x_t^(j)) ) ||^2
        If j > 0:
            α_j = (sim( \tilde{g}_l, g_l ) + 1)/2
            \tilde{g}_l = α_j β \tilde{g}_l + (1 - α_j β) g_l
        Else:
            \tilde{g}_l = g_l
        x_t^(j+1) = x_t^(j) - (ζ/N) \tilde{g}_l
    End
    Compute g_d = ( √(1-\barα_t)/√α_t - √(1-\barα_{t-1}) ) ε_θ(x_t^(N), t)
    x_{t-1} = 1/√α_t x_t^(N) - g_d
End
Return x_0

This procedure inserts an N-step likelihood warm-up with ADM smoothing before the standard prior-based update at each diffusion step, resulting in a stable reverse trajectory.

5. Comparative Performance and Empirical Results

SPGD achieves consistently superior restoration results across diverse image restoration settings, as shown below (metrics: PSNR↑, SSIM↑, LPIPS↓; best in bold) (Wu et al., 9 Jul 2025):

Task	PnP-ADMM	DPS	RED-Diff	SPGD (Ours)
Inpainting (FFHQ)	27.99/0.729/0.306	26.11/0.802/0.180	27.17/0.799/0.159	30.87/0.889/0.120
Gauss. deblur	26.07/0.758/0.260	26.51/0.782/0.181	24.69/0.672/0.288	27.83/0.775/0.172
Motion deblur	25.86/0.772/0.278	25.58/0.752/0.212	26.24/0.706/0.255	29.41/0.834/0.158
SR ×4	27.75/0.835/0.246	27.06/0.803/0.187	29.06/0.800/0.243	29.35/0.831/0.137

On ImageNet, similar gains are maintained:

Inpainting: SPGD 26.28/0.798/0.165, best among compared methods.
Gaussian deblur: SPGD 24.80/0.651/0.229, best among compared methods.

SPGD also produces substantially more visually coherent and artifact-free restorations in inpainting, deblurring, and super-resolution tasks. For example, in inpainting, it reconstructs fine facial features with natural shading, and in deblurring, it recovers crisp edges and high-quality textures while suppressing ringing and ghosting.

6. Broader Context: Connections and Variants

Gradient-tracking diffusion strategies unify several sampling and optimization approaches:

SPGD is one realization, explicitly targeting prior–likelihood conflict and temporal instability in diffusion sampling (Wu et al., 9 Jul 2025).
Related approaches include History Gradient Update (HGU), which tracks and aggregates historical data-fidelity gradients in diffusion-based inverse solvers via momentum or Adam-style updates, yielding accelerated convergence and better robustness (He et al., 2023).
Outside generative restoration, similar gradient-tracking formulations underpin decentralized optimization, such as Flexible Gradient Tracking Algorithms and Local-Update Gradient Tracking, which coordinate stochastic and communication-minimized iterations while maintaining consensus and convergence guarantees (Berahas et al., 2023, Nguyen et al., 2022).
Multiple works in image translation (e.g., Asymmetric Gradient Guidance) also rely on gradient-tracking to balance fidelity and style/content constraints in the sampling loop (Kwon et al., 2023).

These methods share the core principle of leveraging rich local or recent gradient history—often via smoothing, momentum, or progressive scheduling—in the presence of stochasticity or competing objectives to enhance the efficacy, efficiency, and stability of diffusion-based updates.

In summary, gradient-tracking diffusion strategies, and SPGD in particular, represent a mathematically rigorous and empirically validated solution to core instabilities in guided diffusion, providing systematic control over gradient interactions and noise. They are broadly extensible across inverse imaging, multitask restoration, optimization, and distributed learning contexts (Wu et al., 9 Jul 2025, He et al., 2023, Berahas et al., 2023, Nguyen et al., 2022).