- The paper identifies that common diffusion models fail to enforce a zero terminal signal-to-noise ratio, limiting the range of image brightness.
- It proposes adjustments to noise schedules, prediction targets, and sampling steps to better align training with inference.
- Empirical tests on Stable Diffusion reveal enhanced brightness diversity and improved image fidelity as shown by lower FID and higher Inception Scores.
Introduction
In the field of generative AI, diffusion models have gained significant attention for their remarkable ability to produce diverse and high-quality samples, especially images. Among these models, Stable Diffusion stands out as a widely referenced and utilized open-source model. However, despite its achievements, experts have noted limitations in its capacity to generate images across the full spectrum of brightness, particularly struggling with very bright or very dark images.
The Flaws in Diffusion Models
A recent examination of diffusion models, specifically focusing on their noise schedules and sampling steps, has uncovered a fundamental flaw. These models, including Stable Diffusion, commonly do not enforce a zero signal-to-noise ratio (SNR) at the final timestep of the diffusion process. This deviation from ideal theory introduces a discrepancy between the model's training and inference stages, constraining it to produce images predominantly with medium brightness and impacting the generation of correct samples for prompts implying more extreme brightness levels.
Proposed Solutions
The paper outlines several corrective measures to realign diffusion models with their theoretical blueprint. The suggestions are as follows:
- Adjust the noise schedule to ensure zero terminal SNR, which essentially implies introducing pure Gaussian noise at the final timestep.
- Implement training using the variance-preserving formulation, while considering a slight methodological switch to focus on predicting a variable
v
rather than the noise ϵ
.
- Amend the sampling step selection process to always initiate from the last timestep, a step that is particularly beneficial when fewer sampling steps are taken at inference for efficiency.
- Introduce a scaling technique for classifier-free guidance to prevent overexposure in final images as the terminal SNR approaches zero.
Evaluation and Implications
Further training of the Stable Diffusion model with these adjustments yielded promising results. The refined model demonstrated the ability to generate a wider range of brightness in images, including capturing finer details per provided prompts. Quantitative metrics such as the Frechet Inception Distance and Inception Score also improved, indicating a higher fidelity of generated images to the target data distribution.
The findings present in this paper advocate for awareness in the diffusion model design process and encourage the inclusion of these adjustments in future iterations. By enforcing coherent behavior across training and inference stages and accounting for a zero terminal SNR, generative models can better capture the diversity and dynamics of real-world data distributions.