Common Diffusion Noise Schedules and Sample Steps are Flawed (2305.08891v4)

Published 15 May 2023 in cs.CV

Abstract: We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR), and some implementations of diffusion samplers do not start from the last timestep. Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference. We show that the flawed design causes real problems in existing implementations. In Stable Diffusion, it severely limits the model to only generate images with medium brightness and prevents it from generating very bright and dark samples. We propose a few simple fixes: (1) rescale the noise schedule to enforce zero terminal SNR; (2) train the model with v prediction; (3) change the sampler to always start from the last timestep; (4) rescale classifier-free guidance to prevent over-exposure. These simple changes ensure the diffusion process is congruent between training and inference and allow the model to generate samples more faithful to the original data distribution.

Citations (153)

View on Semantic Scholar

Summary

The paper identifies that common diffusion models fail to enforce a zero terminal signal-to-noise ratio, limiting the range of image brightness.
It proposes adjustments to noise schedules, prediction targets, and sampling steps to better align training with inference.
Empirical tests on Stable Diffusion reveal enhanced brightness diversity and improved image fidelity as shown by lower FID and higher Inception Scores.

Introduction

In the field of generative AI, diffusion models have gained significant attention for their remarkable ability to produce diverse and high-quality samples, especially images. Among these models, Stable Diffusion stands out as a widely referenced and utilized open-source model. However, despite its achievements, experts have noted limitations in its capacity to generate images across the full spectrum of brightness, particularly struggling with very bright or very dark images.

The Flaws in Diffusion Models

A recent examination of diffusion models, specifically focusing on their noise schedules and sampling steps, has uncovered a fundamental flaw. These models, including Stable Diffusion, commonly do not enforce a zero signal-to-noise ratio (SNR) at the final timestep of the diffusion process. This deviation from ideal theory introduces a discrepancy between the model's training and inference stages, constraining it to produce images predominantly with medium brightness and impacting the generation of correct samples for prompts implying more extreme brightness levels.

Proposed Solutions

The paper outlines several corrective measures to realign diffusion models with their theoretical blueprint. The suggestions are as follows:

Adjust the noise schedule to ensure zero terminal SNR, which essentially implies introducing pure Gaussian noise at the final timestep.
Implement training using the variance-preserving formulation, while considering a slight methodological switch to focus on predicting a variable v rather than the noise ϵ.
Amend the sampling step selection process to always initiate from the last timestep, a step that is particularly beneficial when fewer sampling steps are taken at inference for efficiency.
Introduce a scaling technique for classifier-free guidance to prevent overexposure in final images as the terminal SNR approaches zero.

Evaluation and Implications

Further training of the Stable Diffusion model with these adjustments yielded promising results. The refined model demonstrated the ability to generate a wider range of brightness in images, including capturing finer details per provided prompts. Quantitative metrics such as the Frechet Inception Distance and Inception Score also improved, indicating a higher fidelity of generated images to the target data distribution.

The findings present in this paper advocate for awareness in the diffusion model design process and encourage the inclusion of these adjustments in future iterations. By enforcing coherent behavior across training and inference stages and accounting for a zero terminal SNR, generative models can better capture the diversity and dynamics of real-world data distributions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ryan_landay/status/1919019383696892313

https://twitter.com/Shauray7/status/1796782554009178555

https://twitter.com/sameQCU/status/1846269156859302318

https://twitter.com/jupiter186/status/1874112958877622473

https://twitter.com/laplaceFactor/status/1885863910194000232

https://twitter.com/sameQCU/status/1906828646444384714

YouTube

Show All Videos