RigL-DM: Dynamic Sparse Diffusion Training
- RigL-DM is a dynamic sparse training algorithm that adapts nonzero parameters during training to optimize reverse denoising networks in diffusion models.
- It employs periodic pruning and gradient-based regrowth to maintain a fixed sparsity level, achieving substantial reductions in trainable parameters and computational cost.
- RigL-DM enables training sparse diffusion models from scratch without dense pre-training, delivering competitive or superior image quality compared to dense models.
RigL-DM is a dynamic sparse training algorithm specifically adapted for the reverse denoising networks of unconditional diffusion models (DMs). Unlike dense training or static sparse masking, RigL-DM exploits the dynamic adjustment of nonzero parameters during the entire training process, maintaining a fixed sparsity level while seeking efficiency in both memory and computation. RigL-DM enables training sparse DMs from scratch, requiring no dense pre-training, and targets substantial reduction in trainable parameters and floating-point operations (FLOPs) with matching or superior generation quality compared to dense or statically pruned baselines (Oliveira et al., 30 Apr 2025).
1. Principles of Dynamic Sparse Training with RigL
RigL (“Rigging the Lottery”) is a Dynamic Sparse Training (DST) algorithm that maintains a fixed global sparsity throughout training. The method operates by periodically pruning and regrowing connections in the weight tensor. Active connections are those with nonzero weights, controlled by a mask applied to the weights at training iteration .
- Pruning: At scheduled intervals, the smallest-magnitude active weights are removed. Specifically, a prune threshold is computed to eliminate the lowest -fraction of active weights.
- Regrowth: For all inactive (zeroed) positions, the absolute value of their gradients is calculated, and the positions with highest scores are re-activated to maintain total active weights at .
- Mask Update: The mask is updated according to the sets of pruned and regrown positions, maintaining the overall sparsity.
The pruning-regrowth rate and the update interval are hyperparameters crucial to training performance and stability, especially at high sparsity.
2. Adaptation of RigL for Diffusion Models
RigL-DM targets the denoising network within various DM architectures. In Latent Diffusion, RigL-DM applies a shared global mask to all U-Net weights, keeping the autoencoder frozen. For ChiroDiff, a mask is applied to every linear and GRU layer. RigL-DM's training objective remains the canonical DM forward loss: Between mask updates, standard masked gradient descent using AdamW is performed. Mask updates—pruning and regrowth—are executed every gradient steps ( for Latent Diffusion, for ChiroDiff).
3. Algorithmic Workflow and Mathematical Formulation
RigL-DM relies on component-wise masking and regular mask updates to maintain model sparsity while allowing the mask structure to adapt. Key operations include:
- Masked weights: , where denotes element-wise multiplication.
- Pruning criterion: Compute as the -th quantile of active , with if .
- Gradient-based regrowth: For all positions with , scores are computed, and top positions (by ) are activated through mask update.
- Sparsity schedule: Overall sparsity is typically kept constant; ramping up through warm-up is optional.
Pseudocode:
A summarized workflow is as follows:
1 2 3 4 5 6 7 8 9 10 |
Initialize θ, Mask M0 with ERK at sparsity S for each training iteration t: Sample batch x₀∼𝒟, timestep t∼Uniform[1,Td], noise ε∼𝒩(0,I) x_t ← √α_t x₀ + √(1−α_t) ε Compute loss ℓ = ‖ε − ε_θ(M_{t−1}∘θ;x_t,t)‖² Update θ with AdamW if t mod Δ == 0: Prune active weights by magnitude (remove p-fraction) Regrow with top p-fraction of zero positions by |∂ℓ/∂θ_i| |
4. Comparative Empirical Results
RigL-DM was evaluated on six benchmarks encompassing both Latent Diffusion (CelebA-HQ, LSUN-Bedrooms, Imagenette) and ChiroDiff (QuickDraw, KanjiVG, VMNIST). Baselines include Dense, Static-DM (fixed ERK mask), and MagRan-DM (random regrowth plus magnitude pruning).
Summary Table: Empirical Outcomes (FID↓, Params×, Train-FLOPs×, averaged over 3 runs)
| Dataset | Method | FID | Params× | FLOPs× |
|---|---|---|---|---|
| CelebA-HQ | Dense | 32.74 | 1.00 | 1.00 |
| Static-DM | 33.19 | 0.50 | 0.68 | |
| MagRan-DM | 32.83 | 0.50 | 0.67 | |
| RigL-DM | 32.12 | 0.75 | 0.91 | |
| LSUN-Bedrooms | Dense | 31.09 | 1.00 | 1.00 |
| Static-DM | 28.79 | 0.75 | 0.91 | |
| MagRan-DM | 28.20 | 0.75 | 0.91 | |
| RigL-DM | 37.80 | 0.90 | 0.97 |
RigL-DM typically matches or outperforms dense and static baselines in low- to medium‐sparsity regimes (–$0.5$), e.g. achieving lower FID than dense on CelebA-HQ at . Static-DM degrades rapidly for , while MagRan-DM can be competitive at extreme sparsity (notably in QuickDraw at 90%). For all six benchmarks, a safe and effective sparsity range is .
5. Hyperparameter Tuning and Stability Considerations
The principal hyperparameters governing RigL-DM performance are sparsity , update interval , and the prune/regrowth fraction .
- Sparsity in is reliably effective across all benchmarks.
- Update intervals around iterations yield robust adaptation.
- For moderate , strikes a balance between discovery of new structure and stable training. For , reducing to $0.05$ or increasing prevents destabilization and may improve generalization.
- No dense pre-training is required; sparse-to-sparse training is executed from model initialization.
A conservative mask update at high sparsity enhances stability, avoiding sharp performance decay seen in static masking at similar parameter fractions.
6. Implications, Applications, and Limitations
RigL-DM demonstrates that dynamic sparse training is feasible for both convolutional and recurrent DM backbones, enabling reduction of trainable parameter count and FLOPs by up to 75% and 50%, respectively, without compromising sample quality. This suggests significant potential for scaling DM architectures to resource-constrained or real-time applications, particularly where memory and compute efficiency are at a premium.
A plausible implication is that dynamic sparse-to-sparse training paradigms, exemplified by RigL-DM, may generalize to other generative architectures or structured predictions where dense parameterization is a limiting factor.
Limitations arise at extreme sparsity; although MagRan-DM may sometimes rival RigL-DM, performance can be dataset-dependent and is sensitive to hyperparameters.
7. Related Methodologies and Future Research Directions
RigL-DM draws from the Dynamic Sparse Training paradigm introduced by Evci et al. (“Rigging the Lottery: Making All Tickets Winners”), extending the approach to diffusion-based generative models (Oliveira et al., 30 Apr 2025). Alternative sparsity-inducing baselines, such as Static-DM and MagRan-DM, complement the evaluation landscape.
Future research directions include extending sparse-to-sparse training to conditional DMs, exploring architectural sparsity patterns, and benchmarking on larger or diversely structured datasets. The effect of mask update strategies and initialization methods (e.g., ERK-masks) on generalization and convergence warrants systematic paper. The interaction between mask dynamics and optimizer adaptation, especially under high-sparsity regimes, remains an open technical question.
References
- Oliveira et al., “Sparse-to-Sparse Training of Diffusion Models” (Oliveira et al., 30 Apr 2025)
- Evci et al., “Rigging the Lottery: Making All Tickets Winners.” ICML 2020