Simple diffusion: End-to-end diffusion for high resolution images (2301.11093v2)

Published 26 Jan 2023 in cs.CV, cs.LG, and stat.ML

Abstract: Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.

PDF HTML Abstract

An Expert Overview of "Simple Diffusion: End-to-End Diffusion for High Resolution Images"

This paper presents a comprehensive examination of the use of diffusion models in generating high-resolution images, specifically targeting simplification without sacrificing performance. Diffusion models have demonstrated exceptional effectiveness across various data generation tasks including images, audio, and video. However, their application to high-resolution images has been traditionally complicated by the necessity of operating in latent spaces or using multi-stage generation processes.

Key Contributions

The authors propose notable modifications to standard denoising diffusion models, resulting in an approach termed "simple diffusion," which accommodates high-resolution image generation with comparable efficacy to more complex methods. The central contributions consist of:

Adjusted Noise Schedules: The paper introduces tailored noise schedules that accommodate the requirements of higher resolutions. The traditional cosine schedule is shifted to maintain a stable signal-to-noise ratio across resolution scales, allowing the diffusion model to address the reconstruction of both global and local structures effectively.
Architecture Scaling Strategy: The research demonstrates that scaling the architecture is feasible by primarily focusing on a specific resolution within the architecture—specifically the $16 \times 16$ block of a U-Net model. This strategy reduces memory usage and computational demands while maintaining high performance.
Dropout and Downsampling Techniques: The application of dropout is refined, emphasizing its use predominantly in lower-resolution model layers. High-resolution feature maps are handled through downsampling techniques such as Discrete Wavelet Transform (DWT) or conventional convolutions to optimize for both efficiency and effectiveness.
Multiscale Loss Function: An innovative multiscale training loss is developed, enhancing convergence by incorporating loss calculations from downsampled resolutions with weighted importance. This addresses the challenge of overfitting high-frequency details without substantial computational overhead.

Empirical Results

The application of these strategies yields state-of-the-art performance on ImageNet image generation tasks without the employment of sampling modifications like classifier-free guidance or rejection sampling. Specifically, the models achieve competitive FID and IS scores at resolutions up to 512 $\times$ 512. By minimizing complexity, the authors manage to streamline diffusion processing, while still aligning with the visual quality observed in more elaborate setups such as cascaded generation frameworks.

Implications and Future Directions

The contributions of this research simplify the training and usage of diffusion models in high-resolution contexts. This has substantial implications for practical applications where resource constraints exist or when rapid prototyping is essential. Additionally, the successful adaptation of the simple diffusion framework to text-to-image generation tasks, achieving competitive FID scores on COCO benchmark, showcases its versatility.

Looking forward, future research may capitalize on these findings to further extend diffusion models into domains like video generation or other modalities requiring high-resolution outputs. Furthermore, deeper investigation into adaptive noise schedules across varying data types and densities is warranted as a means of strengthening the generalization of this approach.

In conclusion, this paper makes significant strides in balancing simplicity with performance in the domain of high-resolution image synthesis using diffusion models. The results underscore the potential for further advances in generative modeling driven by streamlined operational methodologies.