RigL-DM: Dynamic Sparse Diffusion Training

Updated 23 November 2025

RigL-DM is a dynamic sparse training algorithm that adapts nonzero parameters during training to optimize reverse denoising networks in diffusion models.
It employs periodic pruning and gradient-based regrowth to maintain a fixed sparsity level, achieving substantial reductions in trainable parameters and computational cost.
RigL-DM enables training sparse diffusion models from scratch without dense pre-training, delivering competitive or superior image quality compared to dense models.

RigL-DM is a dynamic sparse training algorithm specifically adapted for the reverse denoising networks of unconditional diffusion models (DMs). Unlike dense training or static sparse masking, RigL-DM exploits the dynamic adjustment of nonzero parameters during the entire training process, maintaining a fixed sparsity level while seeking efficiency in both memory and computation. RigL-DM enables training sparse DMs from scratch, requiring no dense pre-training, and targets substantial reduction in trainable parameters and floating-point operations (FLOPs) with matching or superior generation quality compared to dense or statically pruned baselines (Oliveira et al., 30 Apr 2025).

1. Principles of Dynamic Sparse Training with RigL

RigL (“Rigging the Lottery”) is a Dynamic Sparse Training (DST) algorithm that maintains a fixed global sparsity $S$ throughout training. The method operates by periodically pruning and regrowing connections in the weight tensor. Active connections are those with nonzero weights, controlled by a mask $\mathbf{M}_t \in \{0,1\}^N$ applied to the weights $\mathbf{W}_t\in\mathbb{R}^N$ at training iteration $t$ .

Pruning: At scheduled intervals, the smallest-magnitude active weights are removed. Specifically, a prune threshold $\tau_p$ is computed to eliminate the lowest $p$ -fraction of active weights.
Regrowth: For all inactive (zeroed) positions, the absolute value of their gradients $s_{t,i} = \bigl|\tfrac{\partial \mathcal{L}}{\partial W_{t,i}}\bigr|$ is calculated, and the positions with highest scores are re-activated to maintain total active weights at $(1-S)N$ .
Mask Update: The mask is updated according to the sets of pruned and regrown positions, maintaining the overall sparsity.

The pruning-regrowth rate $p$ and the update interval $\Delta$ are hyperparameters crucial to training performance and stability, especially at high sparsity.

2. Adaptation of RigL for Diffusion Models

RigL-DM targets the denoising network $\epsilon_\theta(x_t, t)$ within various DM architectures. In Latent Diffusion, RigL-DM applies a shared global mask to all U-Net weights, keeping the autoencoder $(\mathcal{E}, \mathcal{D})$ frozen. For ChiroDiff, a mask is applied to every linear and GRU layer. RigL-DM's training objective remains the canonical DM forward loss: $\mathcal{L} = \mathbb{E}_{x_0, t, \epsilon}\left\|\epsilon - \epsilon_\theta(x_t, t)\right\|^2, \qquad x_t = \sqrt{\alpha_t}\,x_0 + \sqrt{1-\alpha_t}\,\epsilon.$ Between mask updates, standard masked gradient descent using AdamW is performed. Mask updates—pruning and regrowth—are executed every $\Delta$ gradient steps ( $\Delta_{\rm LD}=1100$ for Latent Diffusion, $\Delta_{\rm CD}=800$ for ChiroDiff).

3. Algorithmic Workflow and Mathematical Formulation

RigL-DM relies on component-wise masking and regular mask updates to maintain model sparsity while allowing the mask structure to adapt. Key operations include:

Masked weights: $\widetilde{\mathbf{W}_t} = \mathbf{M}_t \circ \mathbf{W}_t$ , where $\circ$ denotes element-wise multiplication.
Pruning criterion: Compute $\tau_p$ as the $p$ -th quantile of active $|W_{t,i}|$ , with $M_{t+1,i} \leftarrow 0$ if $|W_{t,i}|<\tau_p$ .
Gradient-based regrowth: For all positions $i$ with $M_{t,i}=0$ , scores $s_{t,i}$ are computed, and top $p|\mathcal{A}_t|$ positions (by $s_{t,i}$ ) are activated through mask update.
Sparsity schedule: Overall sparsity $S$ is typically kept constant; ramping $S$ up through warm-up is optional.

Pseudocode:

A summarized workflow is as follows:

Initialize θ, Mask M0 with ERK at sparsity S
for each training iteration t:
    Sample batch x₀∼𝒟, timestep t∼Uniform[1,Td], noise ε∼𝒩(0,I)
    x_t ← √α_t x₀ + √(1−α_t) ε
    Compute loss ℓ = ‖ε − ε_θ(M_{t−1}∘θ;x_t,t)‖²
    Update θ with AdamW
    if t mod Δ == 0:
        Prune active weights by magnitude (remove p-fraction)
        Regrow with top p-fraction of zero positions by |∂ℓ/∂θ_i|

In practice,

p=0.5

is used for moderate sparsity, while

p=0.05

stabilizes at very high sparsity (

S\ge 0.75

4. Comparative Empirical Results

RigL-DM was evaluated on six benchmarks encompassing both Latent Diffusion (CelebA-HQ, LSUN-Bedrooms, Imagenette) and ChiroDiff (QuickDraw, KanjiVG, VMNIST). Baselines include Dense, Static-DM (fixed ERK mask), and MagRan-DM (random regrowth plus magnitude pruning).

Summary Table: Empirical Outcomes (FID↓, Params×, Train-FLOPs×, averaged over 3 runs)

Dataset	Method	FID	Params×	FLOPs×
CelebA-HQ	Dense	32.74	1.00	1.00
	Static-DM	33.19	0.50	0.68
	MagRan-DM	32.83	0.50	0.67
	RigL-DM	32.12	0.75	0.91
LSUN-Bedrooms	Dense	31.09	1.00	1.00
	Static-DM	28.79	0.75	0.91
	MagRan-DM	28.20	0.75	0.91
	RigL-DM	37.80	0.90	0.97

RigL-DM typically matches or outperforms dense and static baselines in low- to medium‐sparsity regimes ( $S=0.25$ –$0.5$), e.g. achieving lower FID than dense on CelebA-HQ at $S=0.25$ . Static-DM degrades rapidly for $S>0.75$ , while MagRan-DM can be competitive at extreme sparsity (notably in QuickDraw at 90%). For all six benchmarks, a safe and effective sparsity range is $S\in[0.25,0.5]$ .

5. Hyperparameter Tuning and Stability Considerations

The principal hyperparameters governing RigL-DM performance are sparsity $S$ , update interval $\Delta$ , and the prune/regrowth fraction $p$ .

Sparsity $S$ in $[0.25, 0.5]$ is reliably effective across all benchmarks.
Update intervals around $\Delta\approx1000$ iterations yield robust adaptation.
For moderate $S$ , $p=0.5$ strikes a balance between discovery of new structure and stable training. For $S\ge0.75$ , reducing $p$ to $0.05$ or increasing $\Delta$ prevents destabilization and may improve generalization.
No dense pre-training is required; sparse-to-sparse training is executed from model initialization.

A conservative mask update at high sparsity enhances stability, avoiding sharp performance decay seen in static masking at similar parameter fractions.

6. Implications, Applications, and Limitations

RigL-DM demonstrates that dynamic sparse training is feasible for both convolutional and recurrent DM backbones, enabling reduction of trainable parameter count and FLOPs by up to 75% and 50%, respectively, without compromising sample quality. This suggests significant potential for scaling DM architectures to resource-constrained or real-time applications, particularly where memory and compute efficiency are at a premium.

A plausible implication is that dynamic sparse-to-sparse training paradigms, exemplified by RigL-DM, may generalize to other generative architectures or structured predictions where dense parameterization is a limiting factor.

Limitations arise at extreme sparsity; although MagRan-DM may sometimes rival RigL-DM, performance can be dataset-dependent and is sensitive to hyperparameters.

RigL-DM draws from the Dynamic Sparse Training paradigm introduced by Evci et al. (“Rigging the Lottery: Making All Tickets Winners”), extending the approach to diffusion-based generative models (Oliveira et al., 30 Apr 2025). Alternative sparsity-inducing baselines, such as Static-DM and MagRan-DM, complement the evaluation landscape.

Future research directions include extending sparse-to-sparse training to conditional DMs, exploring architectural sparsity patterns, and benchmarking on larger or diversely structured datasets. The effect of mask update strategies and initialization methods (e.g., ERK-masks) on generalization and convergence warrants systematic paper. The interaction between mask dynamics and optimizer adaptation, especially under high-sparsity regimes, remains an open technical question.

References

Oliveira et al., “Sparse-to-Sparse Training of Diffusion Models” (Oliveira et al., 30 Apr 2025)
Evci et al., “Rigging the Lottery: Making All Tickets Winners.” ICML 2020

PDF Markdown Chat (Pro)

References (1)

Sparse-to-Sparse Training of Diffusion Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to RigL-DM.

RigL-DM: Dynamic Sparse Diffusion Training

1. Principles of Dynamic Sparse Training with RigL

2. Adaptation of RigL for Diffusion Models

3. Algorithmic Workflow and Mathematical Formulation

4. Comparative Empirical Results

5. Hyperparameter Tuning and Stability Considerations

6. Implications, Applications, and Limitations

7. Related Methodologies and Future Research Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics