Sparse-to-Sparse Diffusion Model Training

Updated 23 November 2025

The paper introduces three training algorithms (Static-DM, RigL-DM, and MagRan-DM) that preserve sparsity during both training and inference to reduce FLOPs and memory usage.
It systematically applies sparse-to-sparse techniques across diverse data domains such as images, motion, and graphs to demonstrate the approach's practical viability.
Empirical evaluations show that moderate sparsity levels up to 50% can achieve optimal fidelity with significant parameter and computational savings.

Sparse-to-sparse training of diffusion models is a paradigm wherein both training and inference phases preserve explicit sparsity in parameters, activations, or data representations. This approach targets reductions in computational and memory overhead for generative models while often maintaining—or even exceeding—the sample quality delivered by dense architectures. Recent work has systematically established methodology for training diffusion models from scratch under sparsity constraints, developed efficient sparse-inference pipelines, and demonstrated practical sparsity for structured domains such as images, motion sequences, and graphs (Oliveira et al., 30 Apr 2025, Cassano et al., 23 Sep 2025, Wang et al., 16 Apr 2024, Bae et al., 18 Mar 2025, Qin et al., 2023).

1. Formalization of Sparsity in Diffusion Models

Standard diffusion models consist of a forward process, typically Gaussian noising for continuous data, and a learned reverse process parameterized by neural networks. Sparsity is introduced via a binary mask $M \in \{0,1\}^M$ over parameters such that only a fraction $(1-s)$ of weights are active, with sparsity level $s \in [0,1]$ measured as the proportion of zeroed parameters:

$s = 1 - \frac{\|M\|_0}{M_\text{total}}$

Active parameters $\theta_\text{active} = \theta \odot M$ restrict both training and inference computations. FLOP reduction is approximately proportional to the fraction of active weights, with $F_\text{sparse} \approx (\|M\|_0 / M_\text{total}) \cdot F_\text{dense}$ (Oliveira et al., 30 Apr 2025).

2. Sparse-to-Sparse Training Algorithms

Three principal sparse-to-sparse training algorithms have been established:

Static-DM: A random sparse mask is generated at initialization using Erdős–Rényi or ERK for layers. The mask remains unchanged; only nonzero weights are trained. Training dynamics mirror the dense model with updates restricted to $\theta \odot M$ .
RigL-DM: The mask evolves during training through prune–grow cycles. Every fixed interval $\Delta T_e$ , the least-magnitude weights are pruned, and new weights are activated based on highest gradient scores, preserving overall sparsity. The prune–grow ratio $p$ is a critical hyperparameter; conservative values (e.g., $p=0.05$ ) prevent instabilities at high sparsity.
MagRan-DM: Similar to RigL-DM but with random selection for newly activated weights in the grow phase instead of using gradient scores.

Training exclusively updates active weights per mask and achieves efficient parameter usage and FLOP savings. Experiments have demonstrated that both static and dynamic sparse training match or outperform dense baselines at sparsity levels up to 50% (Oliveira et al., 30 Apr 2025).

3. Architectures and Data Domains

Sparse-to-sparse training has been systematically applied in multiple architectures and data contexts:

Domain	Architecture	Sparse-to-Sparse Approach
Images	Latent Diffusion U-Net	Static-DM, RigL-DM, MagRan-DM
Motion	Transformer sMDM	Keyframe mask, sparse input attention
Graphs	Transformer GNN	Edge-list sparsity with query edges
Concept Unlearning	SD1.5 UNet + SAE	TopK sparse autoencoder + supervision

This has allowed evaluation on classic vision datasets (CelebA-HQ, LSUN-Bedrooms, ImageNet), motion (HumanML3D, QuickDraw), and large-scale graphs (QM9, SBM, Ego, Planar, Protein) (Oliveira et al., 30 Apr 2025, Bae et al., 18 Mar 2025, Qin et al., 2023, Cassano et al., 23 Sep 2025).

4. Loss Functions and Training Dynamics

The loss functions in sparse-to-sparse diffusion models retain the standard denoising objective, but are adapted to sparse conditioning and parameter subsets:

$L_{\text{DIF}}(\theta) = \mathbb{E}_{t, x_0, \epsilon}\| \epsilon - \epsilon_\theta(x_t, t) \|^2$

In dynamic sparse training (RigL/MagRan), periodic mask updates are performed with interval $\Delta T_e$ , and the prune–grow ratio $p$ controls the trade-off between exploration of new sparse directions and stability (Oliveira et al., 30 Apr 2025). For sparse autoencoder-based concept unlearning (SAEmnesia), a multi-term loss is adopted:

$L_{\text{SAEmnesia}} = L_{\text{unsupSAE}} + \beta \cdot L_{\text{supSAE}} + \lambda \cdot L_{L1}$

where $L_{\text{unsupSAE}}$ promotes reconstruction, $L_{\text{supSAE}}$ binds concepts to specific latents, and $L_{L1}$ enforces activation sparsity (Cassano et al., 23 Sep 2025).

5. Empirical Performance and Ablations

Extensive empirical evaluation reveals that moderate sparsity ( $s=0.25\text{--}0.50$ ) typically yields optimal results. In some cases, sparse models surpass dense baselines in generative fidelity and downstream metrics:

Selected Results

Model/Dataset	Dense FID	Sparse FID	Params/FLOPs	Notes
LD on CelebA-HQ	32.74	32.12	75% params, 90% FLOPs	RigL-DM, $s=0.25$
LD on LSUN-Bedrooms	31.09	28.79	75% params, 91% FLOPs	Static-DM, $s=0.25$ (Sparse better)
ChiroDiff QuickDraw	29.78	29.38	89% params, 89% FLOPs	RigL-DM, $s=0.10$
SparseDiff SBM	56%	—	Linear memory/compute	Graph validity
SAEmnesia UnlearnCanvas	82.29%	91.51%	96.7% reduction in tuning cost	Concept unlearning

Ablations in prune–grow ratio $p$ show that low values ( $p=0.05$ ) maintain fidelity especially at high sparsity, whereas aggressive grow ( $p=0.5$ ) leads to degradation (Oliveira et al., 30 Apr 2025). For SparseDiff, edge-list based sparse training achieves state-of-the-art results on large graphs and quadratically reduced memory requirements (Qin et al., 2023). In motion diffusion, sparse keyframes yield 25× reduction in attention FLOPs and outperform dense baselines in text alignment and realism metrics (Bae et al., 18 Mar 2025).

6. Limitations and Future Directions

Current sparse-to-sparse training implementations operate on dense-optimized hardware; true runtime gains will require specialized sparse accelerators (NVIDIA 2:4, Cerebras CS-3, Neural Magic) (Oliveira et al., 30 Apr 2025). Structured sparsity patterns (e.g. N:M) present open research questions. In graph domains, quadratic sampling time remains necessary to fill the adjacency matrix at generation, although training scales linearly with edge count (Qin et al., 2023). Extending these paradigms to conditional generation (text, video, audio) and integration with dynamic growth–prune curriculums or lottery ticket methodologies remain high-value tasks (Wang et al., 16 Apr 2024, Oliveira et al., 30 Apr 2025).

7. Specialized Sparse Training: Concept Unlearning and Keyframe Diffusion

Sparse-to-sparse training has enabled advanced model manipulation and domain adaptation:

SAEmnesia introduces a supervised TopK sparse autoencoder in cross-attention bottlenecks, enforcing one-to-one neuron–concept correspondences and reducing inference hyperparameter search complexity by 96.7% (Cassano et al., 23 Sep 2025).
Motion Diffusion with Sparse Keyframes: sMDM applies sparse attention and feature interpolation to keyframes selected by geometric algorithms, achieving higher fidelity, reduced compute, and adaptive sparse masks at inference (Bae et al., 18 Mar 2025).
Graph Generation: SparseDiff leverages a sparsity-preserving forward noise kernel and sparse denoising architecture, handling very large graphs with $O(n + m)$ scaling during training (Qin et al., 2023).

These frameworks validate the paradigm of training and deploying diffusion models entirely within the sparse regime, opening avenues for scalable, interpretable, and resource-efficient generative modeling.