Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection (2310.13545v1)

Published 20 Oct 2023 in cs.CV and cs.AI

Abstract: In diffusion models, UNet is the most popular network backbone, since its long skip connects (LSCs) to connect distant network blocks can aggregate long-distant information and alleviate vanishing gradient. Unfortunately, UNet often suffers from unstable training in diffusion models which can be alleviated by scaling its LSC coefficients smaller. However, theoretical understandings of the instability of UNet in diffusion models and also the performance improvement of LSC scaling remain absent yet. To solve this issue, we theoretically show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. Specifically, the hidden feature and gradient of UNet at any layer can oscillate and their oscillation ranges are actually large which explains the instability of UNet training. Moreover, UNet is also provably sensitive to perturbed input, and predicts an output distant from the desired output, yielding oscillatory loss and thus oscillatory gradient. Besides, we also observe the theoretical benefits of the LSC coefficient scaling of UNet in the stableness of hidden features and gradient and also robustness. Finally, inspired by our theory, we propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improves the training stability of UNet. Experimental results on four famous datasets show that our methods are superior to stabilize training and yield about 1.5x training acceleration on different diffusion models with UNet or UViT backbones. Code: https://github.com/sail-sg/ScaleLong

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  2. Efficient diffusion training via min-snr weighting strategy. arXiv preprint arXiv:2303.09556, 2023.
  3. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  4. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  5. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  6. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  7. At-ddpm: Restoring faces degraded by atmospheric turbulence using denoising diffusion probabilistic models. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3423–3432, 2022.
  8. Ilvr: Conditioning method for denoising diffusion probabilistic models. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14347–14356, 2021.
  9. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 649–655, 2022.
  10. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  11. Infovae: Balancing learning and inference in variational autoencoders. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 5885–5892, 2019.
  12. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  13. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  14. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
  15. Masked diffusion transformer is a strong image synthesizer. ArXiv, 2023.
  16. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  17. Rodin: A generative model for sculpting 3d digital avatars using diffusion. arXiv preprint arXiv:2212.06135, 2022.
  18. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
  19. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  20. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
  21. Align your latents: High-resolution video synthesis with latent diffusion models. arXiv preprint arXiv:2304.08818, 2023.
  22. Kbnet: Kernel basis network for image restoration. arXiv preprint arXiv:2303.02881, 2023.
  23. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14821–14831, 2021.
  24. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17683–17693, 2022.
  25. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5728–5739, 2022.
  26. Simple baselines for image restoration. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pages 17–33. Springer, 2022.
  27. Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2020.
  28. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:4713–4726, 2021.
  29. Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487, 2022.
  30. Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020.
  31. A style-based generator architecture for generative adversarial networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018.
  32. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  33. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  34. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2014.
  35. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  36. All are worth words: a vit backbone for score-based diffusion models. arXiv preprint arXiv:2209.12152, 2022.
  37. Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
  38. Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models. arXiv preprint arXiv:2305.05189, 2023.
  39. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. ArXiv, abs/2208.12242, 2022.
  40. Multi-concept customization of text-to-image diffusion. ArXiv, abs/2212.04488, 2022.
  41. Vector quantized diffusion model for text-to-image synthesis. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10686–10696, 2021.
  42. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. ArXiv, abs/2212.14704, 2022.
  43. Denoising diffusion probabilistic models for 3d medical image generation. Scientific Reports, 13, 2022.
  44. Hinet: Half instance normalization network for image restoration. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 182–192, 2021.
  45. Memnet: A persistent memory network for image restoration. 2017 IEEE International Conference on Computer Vision (ICCV), pages 4549–4557, 2017.
  46. Stabilize deep resnet with a sharp scaling factor tau. 2019.
  47. On the stability of multi-branch network, 2021.
  48. Rezero is all you need: Fast convergence at large depth. In Conference on Uncertainty in Artificial Intelligence, 2020.
  49. Fixup initialization: Residual learning without normalization. ArXiv, abs/1901.09321, 2019.
  50. Batch normalization biases residual blocks towards the identity function in deep networks. arXiv: Learning, 2020.
  51. On layer normalization in the transformer architecture, 2020.
  52. Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. ArXiv, abs/2302.10322, 2023.
  53. Stable resnet. In International Conference on Artificial Intelligence and Statistics, 2020.
  54. How to start training: The effect of initialization and architecture. Advances in Neural Information Processing Systems, 31, 2018.
  55. The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pages 342–350. PMLR, 2017.
  56. Revisiting internal covariate shift for batch normalization. IEEE Transactions on Neural Networks and Learning Systems, 32:5082–5092, 2020.
  57. An internal covariate shift bounding algorithm for deep neural networks by unitizing layers’ outputs. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8462–8470, 2020.
  58. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
  59. How does batch normalization help optimization? (no, it is not about internal covariate shift). ArXiv, abs/1805.11604, 2018.
  60. Recurrent batch normalization. ArXiv, abs/1603.09025, 2016.
  61. Adversarial robustness under long-tailed distribution. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8655–8664, 2021.
  62. Imbalance fault diagnosis under long-tailed distribution: Challenges, solutions and prospects. Knowl. Based Syst., 258:110008, 2022.
  63. A real-time surface defect detection system for industrial products with long-tailed distribution. 2021 IEEE Industrial Electronics and Applications Conference (IEACon), pages 313–317, 2021.
  64. Long-tailed recognition by routing diverse distribution-aware experts. ArXiv, abs/2010.01809, 2020.
  65. Long- tailed recognition via weight balancing. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6887–6897, 2022.
  66. One transformer fits all distributions in multi-modal diffusion at scale. ArXiv, abs/2303.06555, 2023.
  67. Diffusion models and semi-supervised learners benefit mutually with few labels. ArXiv, abs/2302.10586, 2023.
  68. Elucidating the design space of diffusion-based generative models. ArXiv, abs/2206.00364, 2022.
  69. Glow: Generative flow with invertible 1x1 convolutions. ArXiv, abs/1807.03039, 2018.
  70. Your vit is secretly a hybrid discriminative-generative diffusion model. arXiv preprint arXiv:2208.07791, 2022.
  71. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  72. Frido: Feature pyramid diffusion for complex scene image synthesis. ArXiv, abs/2208.13753, 2022.
  73. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  74. Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. In International Conference on Machine Learning, 2021.
  75. Fast and accurate deep network learning by exponential linear units (elus). arXiv: Learning, 2015.
  76. Abien Fred Agarap. Deep learning using rectified linear units (relu). ArXiv, abs/1803.08375, 2018.
  77. Instance enhancement batch normalization: an adaptive regulator of batch noise. In AAAI Conference on Artificial Intelligence, 2019.
  78. Diganta Misra. Mish: A self regularized non-monotonic activation function. In British Machine Vision Conference, 2020.
  79. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:2011–2023, 2017.
  80. Switchable self-attention module. ArXiv, abs/2209.05680, 2022.
  81. Cbam: Convolutional block attention module. In European Conference on Computer Vision, 2018.
  82. Dianet: Dense-and-implicit attention network. In AAAI Conference on Artificial Intelligence, 2019.
  83. Gavin E Crooks. Survey of simple, continuous, univariate probability distributions. Technical report, Technical report, Lawrence Berkeley National Lab, 2013., 2012.
  84. Robert B Davies. Algorithm as 155: The distribution of a linear combination of χ𝜒\chiitalic_χ 2 random variables. Applied Statistics, pages 323–333, 1980.
  85. Robert B Davies. Numerical inversion of a characteristic function. Biometrika, 60(2):415–417, 1973.
  86. Rethinking the pruning criteria for convolutional neural network. Advances in Neural Information Processing Systems, 34:16305–16318, 2021.
  87. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
Citations (20)

Summary

  • The paper demonstrates that scaling long skip connections in UNet effectively stabilizes both forward and backward propagations, reducing feature oscillations.
  • It introduces two strategies—constant scaling and learnable scaling—that adaptively mitigate training instabilities and enhance robustness to noisy inputs.
  • Experimental results show that ScaleLong accelerates training by up to 1.5x and improves performance metrics like FID, offering practical benefits for generative modeling.

ScaleLong: Advancements in Stabilizing Diffusion Model Training

The paper "ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection" presents a theoretical and empirical investigation into stabilizing the training of diffusion models, specifically targeting the widely used UNet architecture. Diffusion models have seen significant adoption in generative modeling owing to their superior performance in simulating realistic data distributions. However, instabilities are inherent during training due to the model's complex denoising processes. The authors propose a novel theoretical framework and associated methodologies for scaling long skip connections (LSC) in UNet to mitigate these instabilities.

Theoretical Contributions

The paper provides a thorough theoretical analysis of the stability issues present in the UNet architecture when applied to diffusion models. These are primarily attributed to oscillations in both forward and backward propagations and to the model’s sensitivity to noisy input:

  1. Forward Propagation Stability: The authors prove that the norm of hidden features in UNet is affected significantly by the scaling coefficients of LSCs. The oscillation range of these hidden features can lead to unstable training if not controlled. The standard UNet, with all LSC scaling coefficients set to one, permits large feature oscillations, identified as the primary source of instability.
  2. Backward Propagation Stability: The paper establishes that the gradient norms influence parameter updates, and overly large gradients can lead to instability. This is again tied to the scaling coefficients of the LSCs—appropriately adjusted coefficients can moderate this effect.
  3. Robustness to Noisy Input: A model's robustness to additional noise in input is crucial since training often introduces unintended noise. UNet’s output sensitivity to input perturbations is explicated via derived robustness bounds, demonstrating that appropriately scaling the LSC coefficients enhances the model's resilience to noise.

Proposed Solutions

The paper introduces the "ScaleLong" framework, advocating two specific strategies:

  1. Constant Scaling (CS): By setting LSC coefficients to decay exponentially, the authors propose a straightforward method to enhance stability across different network depths and training scenarios. Theoretical analysis indicates that this method results in significantly reduced oscillation bounds, thereby stabilizing both forward and backward pass operations.
  2. Learnable Scaling (LS): Here, the scaling coefficients are dynamically learned through a small auxiliary network, allowing adaptation to the data and training dynamics. This method offers flexibly optimized stability improvements over the constant scaling approach, providing an additional performance gain.

Experimental Insights

The empirical results corroborate the theoretical underpinnings, demonstrating substantial improvements in training stability and speed across multiple datasets and model settings. The proposed scaling methods—especially learnable scaling—show substantial reductions in training time (by at least 1.5 times faster in several scenarios) while improving generation performance scores such as the Fréchet Inception Distance (FID).

Further, the paper benchmarks the robustness of the proposed solutions against various experimental conditions, such as different batch sizes and model depths, with the ScaleLong framework consistently outperforming existing UNet configurations and some other scaling approaches like the 1/21/\sqrt{2} coefficient scaling.

Implications and Future Directions

This research contributes significantly to the practical and theoretical understanding of training generative diffusion models. By providing a concrete framework and set of tools (ScaleLong) to address training instabilities, the work pushes forward the boundaries of what is achievable with diffusion models in generative tasks. There remains potential for further optimization, particularly in fine-tuning the learnable scaling parameters and exploring broader applications in other model architectures. Overall, this research lays a robust groundwork for the design of stable, efficient generative models facilitated by strategic architectural modifications.

Youtube Logo Streamline Icon: https://streamlinehq.com