ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection (2310.13545v1)

Published 20 Oct 2023 in cs.CV and cs.AI

Abstract: In diffusion models, UNet is the most popular network backbone, since its long skip connects (LSCs) to connect distant network blocks can aggregate long-distant information and alleviate vanishing gradient. Unfortunately, UNet often suffers from unstable training in diffusion models which can be alleviated by scaling its LSC coefficients smaller. However, theoretical understandings of the instability of UNet in diffusion models and also the performance improvement of LSC scaling remain absent yet. To solve this issue, we theoretically show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. Specifically, the hidden feature and gradient of UNet at any layer can oscillate and their oscillation ranges are actually large which explains the instability of UNet training. Moreover, UNet is also provably sensitive to perturbed input, and predicts an output distant from the desired output, yielding oscillatory loss and thus oscillatory gradient. Besides, we also observe the theoretical benefits of the LSC coefficient scaling of UNet in the stableness of hidden features and gradient and also robustness. Finally, inspired by our theory, we propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improves the training stability of UNet. Experimental results on four famous datasets show that our methods are superior to stabilize training and yield about 1.5x training acceleration on different diffusion models with UNet or UViT backbones. Code: https://github.com/sail-sg/ScaleLong

References (87)

Citations (20)

View on Semantic Scholar

Summary

The paper demonstrates that scaling long skip connections in UNet effectively stabilizes both forward and backward propagations, reducing feature oscillations.
It introduces two strategies—constant scaling and learnable scaling—that adaptively mitigate training instabilities and enhance robustness to noisy inputs.
Experimental results show that ScaleLong accelerates training by up to 1.5x and improves performance metrics like FID, offering practical benefits for generative modeling.

ScaleLong: Advancements in Stabilizing Diffusion Model Training

The paper "ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection" presents a theoretical and empirical investigation into stabilizing the training of diffusion models, specifically targeting the widely used UNet architecture. Diffusion models have seen significant adoption in generative modeling owing to their superior performance in simulating realistic data distributions. However, instabilities are inherent during training due to the model's complex denoising processes. The authors propose a novel theoretical framework and associated methodologies for scaling long skip connections (LSC) in UNet to mitigate these instabilities.

Theoretical Contributions

The paper provides a thorough theoretical analysis of the stability issues present in the UNet architecture when applied to diffusion models. These are primarily attributed to oscillations in both forward and backward propagations and to the model’s sensitivity to noisy input:

Forward Propagation Stability: The authors prove that the norm of hidden features in UNet is affected significantly by the scaling coefficients of LSCs. The oscillation range of these hidden features can lead to unstable training if not controlled. The standard UNet, with all LSC scaling coefficients set to one, permits large feature oscillations, identified as the primary source of instability.
Backward Propagation Stability: The paper establishes that the gradient norms influence parameter updates, and overly large gradients can lead to instability. This is again tied to the scaling coefficients of the LSCs—appropriately adjusted coefficients can moderate this effect.
Robustness to Noisy Input: A model's robustness to additional noise in input is crucial since training often introduces unintended noise. UNet’s output sensitivity to input perturbations is explicated via derived robustness bounds, demonstrating that appropriately scaling the LSC coefficients enhances the model's resilience to noise.

Proposed Solutions

The paper introduces the "ScaleLong" framework, advocating two specific strategies:

Constant Scaling (CS): By setting LSC coefficients to decay exponentially, the authors propose a straightforward method to enhance stability across different network depths and training scenarios. Theoretical analysis indicates that this method results in significantly reduced oscillation bounds, thereby stabilizing both forward and backward pass operations.
Learnable Scaling (LS): Here, the scaling coefficients are dynamically learned through a small auxiliary network, allowing adaptation to the data and training dynamics. This method offers flexibly optimized stability improvements over the constant scaling approach, providing an additional performance gain.

Experimental Insights

The empirical results corroborate the theoretical underpinnings, demonstrating substantial improvements in training stability and speed across multiple datasets and model settings. The proposed scaling methods—especially learnable scaling—show substantial reductions in training time (by at least 1.5 times faster in several scenarios) while improving generation performance scores such as the Fréchet Inception Distance (FID).

Further, the paper benchmarks the robustness of the proposed solutions against various experimental conditions, such as different batch sizes and model depths, with the ScaleLong framework consistently outperforming existing UNet configurations and some other scaling approaches like the $1/\sqrt{2}$ coefficient scaling.

Implications and Future Directions

This research contributes significantly to the practical and theoretical understanding of training generative diffusion models. By providing a concrete framework and set of tools (ScaleLong) to address training instabilities, the work pushes forward the boundaries of what is achievable with diffusion models in generative tasks. There remains potential for further optimization, particularly in fine-tuning the learnable scaling parameters and exploring broader applications in other model architectures. Overall, this research lays a robust groundwork for the design of stable, efficient generative models facilitated by strategic architectural modifications.

PDF Markdown

GitHub

GitHub - sail-sg/ScaleLong: The official repository of paper "ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection" (NeurIPS 2023) (50 stars)

YouTube

Show All Videos