ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection (2310.13545v1)
Abstract: In diffusion models, UNet is the most popular network backbone, since its long skip connects (LSCs) to connect distant network blocks can aggregate long-distant information and alleviate vanishing gradient. Unfortunately, UNet often suffers from unstable training in diffusion models which can be alleviated by scaling its LSC coefficients smaller. However, theoretical understandings of the instability of UNet in diffusion models and also the performance improvement of LSC scaling remain absent yet. To solve this issue, we theoretically show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. Specifically, the hidden feature and gradient of UNet at any layer can oscillate and their oscillation ranges are actually large which explains the instability of UNet training. Moreover, UNet is also provably sensitive to perturbed input, and predicts an output distant from the desired output, yielding oscillatory loss and thus oscillatory gradient. Besides, we also observe the theoretical benefits of the LSC coefficient scaling of UNet in the stableness of hidden features and gradient and also robustness. Finally, inspired by our theory, we propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improves the training stability of UNet. Experimental results on four famous datasets show that our methods are superior to stabilize training and yield about 1.5x training acceleration on different diffusion models with UNet or UViT backbones. Code: https://github.com/sail-sg/ScaleLong
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
- Efficient diffusion training via min-snr weighting strategy. arXiv preprint arXiv:2303.09556, 2023.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
- Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- At-ddpm: Restoring faces degraded by atmospheric turbulence using denoising diffusion probabilistic models. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3423–3432, 2022.
- Ilvr: Conditioning method for denoising diffusion probabilistic models. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14347–14356, 2021.
- Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 649–655, 2022.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Infovae: Balancing learning and inference in variational autoencoders. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 5885–5892, 2019.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
- Masked diffusion transformer is a strong image synthesizer. ArXiv, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Rodin: A generative model for sculpting 3d digital avatars using diffusion. arXiv preprint arXiv:2212.06135, 2022.
- Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
- Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
- Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. arXiv preprint arXiv:2304.08818, 2023.
- Kbnet: Kernel basis network for image restoration. arXiv preprint arXiv:2303.02881, 2023.
- Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14821–14831, 2021.
- Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17683–17693, 2022.
- Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5728–5739, 2022.
- Simple baselines for image restoration. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pages 17–33. Springer, 2022.
- Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2020.
- Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:4713–4726, 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487, 2022.
- Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020.
- A style-based generator architecture for generative adversarial networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2014.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
- All are worth words: a vit backbone for score-based diffusion models. arXiv preprint arXiv:2209.12152, 2022.
- Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
- Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models. arXiv preprint arXiv:2305.05189, 2023.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. ArXiv, abs/2208.12242, 2022.
- Multi-concept customization of text-to-image diffusion. ArXiv, abs/2212.04488, 2022.
- Vector quantized diffusion model for text-to-image synthesis. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10686–10696, 2021.
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. ArXiv, abs/2212.14704, 2022.
- Denoising diffusion probabilistic models for 3d medical image generation. Scientific Reports, 13, 2022.
- Hinet: Half instance normalization network for image restoration. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 182–192, 2021.
- Memnet: A persistent memory network for image restoration. 2017 IEEE International Conference on Computer Vision (ICCV), pages 4549–4557, 2017.
- Stabilize deep resnet with a sharp scaling factor tau. 2019.
- On the stability of multi-branch network, 2021.
- Rezero is all you need: Fast convergence at large depth. In Conference on Uncertainty in Artificial Intelligence, 2020.
- Fixup initialization: Residual learning without normalization. ArXiv, abs/1901.09321, 2019.
- Batch normalization biases residual blocks towards the identity function in deep networks. arXiv: Learning, 2020.
- On layer normalization in the transformer architecture, 2020.
- Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. ArXiv, abs/2302.10322, 2023.
- Stable resnet. In International Conference on Artificial Intelligence and Statistics, 2020.
- How to start training: The effect of initialization and architecture. Advances in Neural Information Processing Systems, 31, 2018.
- The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pages 342–350. PMLR, 2017.
- Revisiting internal covariate shift for batch normalization. IEEE Transactions on Neural Networks and Learning Systems, 32:5082–5092, 2020.
- An internal covariate shift bounding algorithm for deep neural networks by unitizing layers’ outputs. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8462–8470, 2020.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
- How does batch normalization help optimization? (no, it is not about internal covariate shift). ArXiv, abs/1805.11604, 2018.
- Recurrent batch normalization. ArXiv, abs/1603.09025, 2016.
- Adversarial robustness under long-tailed distribution. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8655–8664, 2021.
- Imbalance fault diagnosis under long-tailed distribution: Challenges, solutions and prospects. Knowl. Based Syst., 258:110008, 2022.
- A real-time surface defect detection system for industrial products with long-tailed distribution. 2021 IEEE Industrial Electronics and Applications Conference (IEACon), pages 313–317, 2021.
- Long-tailed recognition by routing diverse distribution-aware experts. ArXiv, abs/2010.01809, 2020.
- Long- tailed recognition via weight balancing. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6887–6897, 2022.
- One transformer fits all distributions in multi-modal diffusion at scale. ArXiv, abs/2303.06555, 2023.
- Diffusion models and semi-supervised learners benefit mutually with few labels. ArXiv, abs/2302.10586, 2023.
- Elucidating the design space of diffusion-based generative models. ArXiv, abs/2206.00364, 2022.
- Glow: Generative flow with invertible 1x1 convolutions. ArXiv, abs/1807.03039, 2018.
- Your vit is secretly a hybrid discriminative-generative diffusion model. arXiv preprint arXiv:2208.07791, 2022.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Frido: Feature pyramid diffusion for complex scene image synthesis. ArXiv, abs/2208.13753, 2022.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. In International Conference on Machine Learning, 2021.
- Fast and accurate deep network learning by exponential linear units (elus). arXiv: Learning, 2015.
- Abien Fred Agarap. Deep learning using rectified linear units (relu). ArXiv, abs/1803.08375, 2018.
- Instance enhancement batch normalization: an adaptive regulator of batch noise. In AAAI Conference on Artificial Intelligence, 2019.
- Diganta Misra. Mish: A self regularized non-monotonic activation function. In British Machine Vision Conference, 2020.
- Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:2011–2023, 2017.
- Switchable self-attention module. ArXiv, abs/2209.05680, 2022.
- Cbam: Convolutional block attention module. In European Conference on Computer Vision, 2018.
- Dianet: Dense-and-implicit attention network. In AAAI Conference on Artificial Intelligence, 2019.
- Gavin E Crooks. Survey of simple, continuous, univariate probability distributions. Technical report, Technical report, Lawrence Berkeley National Lab, 2013., 2012.
- Robert B Davies. Algorithm as 155: The distribution of a linear combination of χ𝜒\chiitalic_χ 2 random variables. Applied Statistics, pages 323–333, 1980.
- Robert B Davies. Numerical inversion of a characteristic function. Biometrika, 60(2):415–417, 1973.
- Rethinking the pruning criteria for convolutional neural network. Advances in Neural Information Processing Systems, 34:16305–16318, 2021.
- A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.