Sparse-to-Sparse Training of Diffusion Models (2504.21380v1)

Published 30 Apr 2025 in cs.LG and cs.CV

Abstract: Diffusion models (DMs) are a powerful type of generative models that have achieved state-of-the-art results in various image synthesis tasks and have shown potential in other domains, such as natural language processing and temporal data modeling. Despite their stable training dynamics and ability to produce diverse high-quality samples, DMs are notorious for requiring significant computational resources, both in the training and inference stages. Previous work has focused mostly on increasing the efficiency of model inference. This paper introduces, for the first time, the paradigm of sparse-to-sparse training to DMs, with the aim of improving both training and inference efficiency. We focus on unconditional generation and train sparse DMs from scratch (Latent Diffusion and ChiroDiff) on six datasets using three different methods (Static-DM, RigL-DM, and MagRan-DM) to study the effect of sparsity in model performance. Our experiments show that sparse DMs are able to match and often outperform their Dense counterparts, while substantially reducing the number of trainable parameters and FLOPs. We also identify safe and effective values to perform sparse-to-sparse training of DMs.

Summary

The paper introduces sparse-to-sparse training for diffusion models, maintaining sparsity from initialization to significantly cut parameters and FLOPs.
It evaluates static and dynamic methods like MagRan-DM and RigL-DM, showing that dynamic approaches often outperform static sparsity at moderate levels.
Experimental results reveal that sparse models can match or exceed dense models in FID while dramatically reducing computational costs across different datasets.

Diffusion Models (DMs) are powerful generative models achieving state-of-the-art results across various domains, particularly image synthesis. However, they suffer from significant computational costs during both training and inference. While previous research has focused on accelerating inference, this paper explores improving efficiency in both stages by introducing sparse-to-sparse training for DMs.

Sparse-to-sparse training aims to train neural networks that maintain sparsity from initialization, reducing parameter count and computational operations (FLOPs). The paper investigates applying this paradigm to DMs, focusing on unstructured sparsity, which allows for arbitrary sparse connectivity patterns. Three methods are proposed and evaluated:

Static-DM: A static sparse training method where the connectivity pattern is fixed at initialization. Sparsity is applied by randomly pruning connections based on modified Erdős–Rényi (ERK) criteria, which considers layer size and convolutional kernel dimensions to assign layer-wise sparsity ratios.
MagRan-DM: A dynamic sparse training method. It initializes a sparse network and dynamically updates the connectivity during training using magnitude pruning (dropping low-magnitude weights) and random regrowth (adding connections randomly). This is akin to the Sparse Evolutionary Training (SET) algorithm.
RigL-DM: Another dynamic sparse training method. It also uses magnitude pruning but employs a gradient-based regrowth strategy (adding connections with high gradient magnitude, inspired by RigL (2001.00831)).

The methods are evaluated on two different DM architectures for unconditional generation:

Latent Diffusion (2112.10752): Applied to image generation on LSUN-Bedrooms (1506.03365), CelebA-HQ (1801.07829), and Imagenette [Howard_Imagenette_2019]. Sparsity is applied only to the U-Net denoiser, keeping the autoencoder dense (Figure 1a).
ChiroDiff [das2023chirodiff]: Applied to sketch generation on KanjiVG, QuickDraw [ha2018a], and VMNIST [das2022sketchode]. Sparsity is applied throughout the entire Bidirectional GRU network (Figure 1b).

Experiments are conducted across various sparsity levels (10%, 25%, 50%, 75%, 90% of parameters removed). Performance is primarily measured using the Frechet Inception Distance (FID) [Heusel2017GANsTB], while efficiency is measured by parameter count and FLOPs.

Key findings include:

Performance vs. Sparsity: For Latent Diffusion on image datasets, sparse models can match or outperform dense models up to 50-75% sparsity. Performance significantly degrades at 90% sparsity. On average, dynamic methods (MagRan-DM, RigL-DM) often perform better than Static-DM, especially at moderate sparsity levels (e.g., 25-50%).
ChiroDiff Results: ChiroDiff on QuickDraw demonstrates remarkable resilience to sparsity, with the MagRan-DM model at 90% sparsity outperforming the dense baseline, suggesting overparameterization in the dense model. On KanjiVG and VMNIST, performance decreases with increasing sparsity, but sparse models can still achieve comparable results to the dense baseline at certain sparsity levels (e.g., 25-50%).
Efficiency Gains: The top-performing sparse models consistently achieve substantial reductions in parameters and FLOPs compared to their dense counterparts. For example, on CelebA-HQ, a RigL-DM model (2001.00831) with 25% sparsity slightly outperforms the dense model while reducing parameters by 25% and FLOPs by 9%. Even greater reductions are possible if a slight performance drop is acceptable (e.g., 50% parameter and 30% FLOP reduction with comparable FID). The most extreme example is ChiroDiff on QuickDraw, where MagRan-DM at 90% sparsity achieves a similar or better FID while reducing parameters and FLOPs by 88%.
Impact of Prune/Regrowth Rate: The paper explores the effect of the prune and regrowth ratio ( $p$ ) for dynamic methods. While initial experiments used $p=0.5$ , subsequent analysis showed that a more conservative rate ( $p=0.05$ ) consistently improves the performance of dynamic methods, particularly in higher sparsity regimes (75% and 90%), often enabling them to outperform Static-DM. This highlights the importance of careful hyperparameter tuning for dynamic sparse training in DMs.
Impact of Diffusion Steps: The relative performance of sparse vs. dense models remains largely consistent across different numbers of sampling steps. A sparse model that outperforms its dense counterpart at 100 steps will likely do so at 50 or 200 steps as well. In some cases, a highly sparse model with more sampling steps can achieve better quality than a dense model with fewer steps, potentially offering a trade-off where increased sampling cost in a much smaller network is favorable.

In conclusion, the paper successfully demonstrates the applicability and benefits of sparse-to-sparse training for Diffusion Models. The findings show that training sparse DMs from scratch can significantly reduce computational resources and memory requirements while maintaining or even improving generative performance across different modalities and datasets. The choice between static and dynamic methods and the tuning of dynamic training parameters (like the prune/regrowth rate) are shown to be important factors for achieving optimal results. The research indicates that sparse-to-sparse training is a promising direction for developing more efficient and accessible DMs in the future, pending wider hardware support for unstructured sparsity.

The code and trained models are planned to be released upon publication.

PDF Markdown

Sparse-to-Sparse Training of Diffusion Models (2504.21380v1)

Summary

Related Papers