Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis (2309.03350v1)

Published 4 Sep 2023 in cs.CV and cs.LG

Abstract: Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain}. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiayan Teng (8 papers)
  2. Wendi Zheng (12 papers)
  3. Ming Ding (219 papers)
  4. Wenyi Hong (14 papers)
  5. Jianqiao Wangni (14 papers)
  6. Zhuoyi Yang (18 papers)
  7. Jie Tang (302 papers)
Citations (24)

Summary

Relay Diffusion: Unifying Diffusion Process Across Resolutions for Image Synthesis

The paper "Relay Diffusion: Unifying Diffusion Process Across Resolutions for Image Synthesis" introduces the Relay Diffusion Model (RDM), a novel approach to high-resolution image synthesis using diffusion models. Diffusion models have demonstrated significant promise in generative tasks; however, they face challenges in efficiently generating high-resolution images. The authors tackle these challenges by proposing RDM, which integrates block noise and blurring diffusion processes to transfer information seamlessly from low-resolution to high-resolution stages.

Technical Contributions and Model Overview

  1. Frequency Domain Analysis: The authors offer insights into the challenges of high-resolution image synthesis through frequency domain analysis. They observe that traditional diffusion models suffer from suboptimal performance due to a mismatch in the signal-to-noise ratio (SNR) when applied to higher resolutions. This mismatch is attributed to the noise schedule's lack of resolution dependence.
  2. Introduction of Block Noise: RDM incorporates block noise as a mechanism to maintain equivalence between low and high-resolution representations. This innovative approach leverages block noise to address the limitations of Gaussian noise, which inadequately correlates with spatial resolution. The usage of block noise effectively ensures consistent noise representation across different resolutions, facilitating smoother transitions in the diffusion process.
  3. Patch-Wise Blurring Diffusion: The paper advances the method of blurring diffusion by applying it patch-wise, specifically targeting grid correlations left by the upsampling process. This blurring is disentangled from the initial noise schedule, which allows the diffusion process to start from low-resolution images rather than pure noise, yielding computational efficiency.
  4. Reduction in Training and Sampling Steps: RDM reduces computational overhead by decreasing the necessity for extensive re-sampling and training steps typically required in super-resolution stages of cascaded models. This reduction is achieved by directly using the results of a low-resolution model as input to the subsequent high-resolution process instead of generating from scratch.

Experimental Evaluation and Results

RDM is empirically evaluated against existing generative models on benchmarks, including CelebA-HQ and ImageNet, at a resolution of 256x256. The model achieves a state-of-the-art Frechet Inception Distance (FID) of 3.15 on CelebA-HQ and strong performance on ImageNet with an FID of 5.27 (without classifier-free guidance) and a notably low sFID, showcasing its efficient handling of spatial fidelity and perceptual quality.

  • Comparative Performance: The model surpasses several baseline architectures, including ADM, LDM, and DiT, highlighting its capability in maintaining high image quality across resolutions while also being computationally economical.
  • Discrepancies Mitigated: Unlike existing cascaded diffusion models that require augmentation techniques to handle distribution mismatch across resolutions, RDM inherently mitigates these through its systematic transition process using blurring diffusion and block noise.

Theoretical and Practical Implications

The paper's contributions extend beyond immediate empirical results, offering theoretical insights into resolution-dependent noise scheduling and the frequency domain's impact on generative modeling. From a practical standpoint, RDM streamlines the generative process while cutting down on resource demands, suggesting a new pathway for scalable high-resolution image generation—particularly relevant for applications requiring efficient processing like realistic image synthesis and beyond.

Future Directions

As shown in the paper, RDM sets a strong precedent for integrating novel noise types and adaptive diffusion processes to optimize high-resolution image synthesis. Future research could further explore adaptive noise scheduling to better tune the transition between Gaussian and block noise inversely with resolution scales. Additionally, extending this framework to multi-modal synthesis tasks or three-dimensional generative tasks could have significant implications in widening the applicability of generative diffusion models in practice.

In conclusion, RDM represents a meaningful step forward in resolving critical issues faced by high-resolution diffusion models, effectively bridging key gaps with innovative approaches in noise handling and process integration across resolutions. The presented work is poised to impact not only the domain of image synthesis but also the broader fields of AI and computational creativity.