Papers
Topics
Authors
Recent
2000 character limit reached

RigL-DM: Dynamic Sparse Diffusion Training

Updated 23 November 2025
  • RigL-DM is a dynamic sparse training algorithm that adapts nonzero parameters during training to optimize reverse denoising networks in diffusion models.
  • It employs periodic pruning and gradient-based regrowth to maintain a fixed sparsity level, achieving substantial reductions in trainable parameters and computational cost.
  • RigL-DM enables training sparse diffusion models from scratch without dense pre-training, delivering competitive or superior image quality compared to dense models.

RigL-DM is a dynamic sparse training algorithm specifically adapted for the reverse denoising networks of unconditional diffusion models (DMs). Unlike dense training or static sparse masking, RigL-DM exploits the dynamic adjustment of nonzero parameters during the entire training process, maintaining a fixed sparsity level while seeking efficiency in both memory and computation. RigL-DM enables training sparse DMs from scratch, requiring no dense pre-training, and targets substantial reduction in trainable parameters and floating-point operations (FLOPs) with matching or superior generation quality compared to dense or statically pruned baselines (Oliveira et al., 30 Apr 2025).

1. Principles of Dynamic Sparse Training with RigL

RigL (“Rigging the Lottery”) is a Dynamic Sparse Training (DST) algorithm that maintains a fixed global sparsity SS throughout training. The method operates by periodically pruning and regrowing connections in the weight tensor. Active connections are those with nonzero weights, controlled by a mask Mt{0,1}N\mathbf{M}_t \in \{0,1\}^N applied to the weights WtRN\mathbf{W}_t\in\mathbb{R}^N at training iteration tt.

  • Pruning: At scheduled intervals, the smallest-magnitude active weights are removed. Specifically, a prune threshold τp\tau_p is computed to eliminate the lowest pp-fraction of active weights.
  • Regrowth: For all inactive (zeroed) positions, the absolute value of their gradients st,i=LWt,is_{t,i} = \bigl|\tfrac{\partial \mathcal{L}}{\partial W_{t,i}}\bigr| is calculated, and the positions with highest scores are re-activated to maintain total active weights at (1S)N(1-S)N.
  • Mask Update: The mask is updated according to the sets of pruned and regrown positions, maintaining the overall sparsity.

The pruning-regrowth rate pp and the update interval Δ\Delta are hyperparameters crucial to training performance and stability, especially at high sparsity.

2. Adaptation of RigL for Diffusion Models

RigL-DM targets the denoising network ϵθ(xt,t)\epsilon_\theta(x_t, t) within various DM architectures. In Latent Diffusion, RigL-DM applies a shared global mask to all U-Net weights, keeping the autoencoder (E,D)(\mathcal{E}, \mathcal{D}) frozen. For ChiroDiff, a mask is applied to every linear and GRU layer. RigL-DM's training objective remains the canonical DM forward loss: L=Ex0,t,ϵϵϵθ(xt,t)2,xt=αtx0+1αtϵ.\mathcal{L} = \mathbb{E}_{x_0, t, \epsilon}\left\|\epsilon - \epsilon_\theta(x_t, t)\right\|^2, \qquad x_t = \sqrt{\alpha_t}\,x_0 + \sqrt{1-\alpha_t}\,\epsilon. Between mask updates, standard masked gradient descent using AdamW is performed. Mask updates—pruning and regrowth—are executed every Δ\Delta gradient steps (ΔLD=1100\Delta_{\rm LD}=1100 for Latent Diffusion, ΔCD=800\Delta_{\rm CD}=800 for ChiroDiff).

3. Algorithmic Workflow and Mathematical Formulation

RigL-DM relies on component-wise masking and regular mask updates to maintain model sparsity while allowing the mask structure to adapt. Key operations include:

  • Masked weights: Wt~=MtWt\widetilde{\mathbf{W}_t} = \mathbf{M}_t \circ \mathbf{W}_t, where \circ denotes element-wise multiplication.
  • Pruning criterion: Compute τp\tau_p as the pp-th quantile of active Wt,i|W_{t,i}|, with Mt+1,i0M_{t+1,i} \leftarrow 0 if Wt,i<τp|W_{t,i}|<\tau_p.
  • Gradient-based regrowth: For all positions ii with Mt,i=0M_{t,i}=0, scores st,is_{t,i} are computed, and top pAtp|\mathcal{A}_t| positions (by st,is_{t,i}) are activated through mask update.
  • Sparsity schedule: Overall sparsity SS is typically kept constant; ramping SS up through warm-up is optional.

Pseudocode:

A summarized workflow is as follows:

1
2
3
4
5
6
7
8
9
10
Initialize θ, Mask M0 with ERK at sparsity S
for each training iteration t:
    Sample batch x𝒟, timestep tUniform[1,Td], noise ε𝒩(0,I)
    x_t  α_t x + (1α_t) ε
    Compute loss ℓ = ε  ε_θ(M_{t1}θ;x_t,t)²
    Update θ with AdamW
    if t mod Δ == 0:
        Prune active weights by magnitude (remove p-fraction)
        Regrow with top p-fraction of zero positions by |ℓ/θ_i|
In practice, p=0.5p=0.5 is used for moderate sparsity, while p=0.05p=0.05 stabilizes at very high sparsity (S0.75S\ge 0.75).

4. Comparative Empirical Results

RigL-DM was evaluated on six benchmarks encompassing both Latent Diffusion (CelebA-HQ, LSUN-Bedrooms, Imagenette) and ChiroDiff (QuickDraw, KanjiVG, VMNIST). Baselines include Dense, Static-DM (fixed ERK mask), and MagRan-DM (random regrowth plus magnitude pruning).

Summary Table: Empirical Outcomes (FID↓, Params×, Train-FLOPs×, averaged over 3 runs)

Dataset Method FID Params× FLOPs×
CelebA-HQ Dense 32.74 1.00 1.00
Static-DM 33.19 0.50 0.68
MagRan-DM 32.83 0.50 0.67
RigL-DM 32.12 0.75 0.91
LSUN-Bedrooms Dense 31.09 1.00 1.00
Static-DM 28.79 0.75 0.91
MagRan-DM 28.20 0.75 0.91
RigL-DM 37.80 0.90 0.97

RigL-DM typically matches or outperforms dense and static baselines in low- to medium‐sparsity regimes (S=0.25S=0.25–$0.5$), e.g. achieving lower FID than dense on CelebA-HQ at S=0.25S=0.25. Static-DM degrades rapidly for S>0.75S>0.75, while MagRan-DM can be competitive at extreme sparsity (notably in QuickDraw at 90%). For all six benchmarks, a safe and effective sparsity range is S[0.25,0.5]S\in[0.25,0.5].

5. Hyperparameter Tuning and Stability Considerations

The principal hyperparameters governing RigL-DM performance are sparsity SS, update interval Δ\Delta, and the prune/regrowth fraction pp.

  • Sparsity SS in [0.25,0.5][0.25, 0.5] is reliably effective across all benchmarks.
  • Update intervals around Δ1000\Delta\approx1000 iterations yield robust adaptation.
  • For moderate SS, p=0.5p=0.5 strikes a balance between discovery of new structure and stable training. For S0.75S\ge0.75, reducing pp to $0.05$ or increasing Δ\Delta prevents destabilization and may improve generalization.
  • No dense pre-training is required; sparse-to-sparse training is executed from model initialization.

A conservative mask update at high sparsity enhances stability, avoiding sharp performance decay seen in static masking at similar parameter fractions.

6. Implications, Applications, and Limitations

RigL-DM demonstrates that dynamic sparse training is feasible for both convolutional and recurrent DM backbones, enabling reduction of trainable parameter count and FLOPs by up to 75% and 50%, respectively, without compromising sample quality. This suggests significant potential for scaling DM architectures to resource-constrained or real-time applications, particularly where memory and compute efficiency are at a premium.

A plausible implication is that dynamic sparse-to-sparse training paradigms, exemplified by RigL-DM, may generalize to other generative architectures or structured predictions where dense parameterization is a limiting factor.

Limitations arise at extreme sparsity; although MagRan-DM may sometimes rival RigL-DM, performance can be dataset-dependent and is sensitive to hyperparameters.

RigL-DM draws from the Dynamic Sparse Training paradigm introduced by Evci et al. (“Rigging the Lottery: Making All Tickets Winners”), extending the approach to diffusion-based generative models (Oliveira et al., 30 Apr 2025). Alternative sparsity-inducing baselines, such as Static-DM and MagRan-DM, complement the evaluation landscape.

Future research directions include extending sparse-to-sparse training to conditional DMs, exploring architectural sparsity patterns, and benchmarking on larger or diversely structured datasets. The effect of mask update strategies and initialization methods (e.g., ERK-masks) on generalization and convergence warrants systematic paper. The interaction between mask dynamics and optimizer adaptation, especially under high-sparsity regimes, remains an open technical question.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RigL-DM.