Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Bring Metric Functions into Diffusion Models (2401.02414v1)

Published 4 Jan 2024 in cs.CV

Abstract: We introduce a Cascaded Diffusion Model (Cas-DM) that improves a Denoising Diffusion Probabilistic Model (DDPM) by effectively incorporating additional metric functions in training. Metric functions such as the LPIPS loss have been proven highly effective in consistency models derived from the score matching. However, for the diffusion counterparts, the methodology and efficacy of adding extra metric functions remain unclear. One major challenge is the mismatch between the noise predicted by a DDPM at each step and the desired clean image that the metric function works well on. To address this problem, we propose Cas-DM, a network architecture that cascades two network modules to effectively apply metric functions to the diffusion model training. The first module, similar to a standard DDPM, learns to predict the added noise and is unaffected by the metric function. The second cascaded module learns to predict the clean image, thereby facilitating the metric function computation. Experiment results show that the proposed diffusion model backbone enables the effective use of the LPIPS loss, leading to state-of-the-art image quality (FID, sFID, IS) on various established benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477.
  2. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, 226–234. PMLR.
  3. Dynamic dual-output diffusion models. In CVPR.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR.
  5. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
  6. Your gan is secretly an energy-based model and you should use discriminator driven latent sampling. NeurIPS.
  7. Pixelsnail: An improved autoregressive generative model. In ICML.
  8. Imagenet: a large-scale hierarchical image database. In CVPR.
  9. Diffusion models beat gans on image synthesis. In NeurIPS.
  10. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
  11. Density estimation using real nvp. In ICLR.
  12. Taming transformers for high-resolution image synthesis. In CVPR.
  13. Frido: Feature pyramid diffusion for complex scene image synthesis. In AAAI.
  14. Generative adversarial nets. In NeurIPS.
  15. Deep residual learning for image recognition. In CVPR.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS.
  17. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.
  18. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In ICML.
  19. Denoising diffusion probabilistic models. NeurIPS.
  20. Perceptual losses for real-time style transfer and super-resolution. In ECCV.
  21. Adversarial score matching and improved sampling for image generation. arXiv preprint arXiv:2009.05475.
  22. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
  23. A style-based generator architecture for generative adversarial networks. In CVPR.
  24. Analyzing and improving the image quality of stylegan. In CVPR.
  25. Glow: Generative flow with invertible 1x1 convolutions. In NeurIPS.
  26. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  27. Learning multiple layers of features from tiny images.
  28. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
  29. Least squares generative adversarial networks. In ICCV.
  30. Generating images with sparse representations. arXiv preprint arXiv:2103.03841.
  31. Improved denoising diffusion probabilistic models. In ICML.
  32. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
  33. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
  34. Generating diverse high-fidelity images with vq-vae-2. NeurIPS.
  35. High-resolution image synthesis with latent diffusion models. In CVPR.
  36. Palette: Image-to-image diffusion models. In SIGGRAPH.
  37. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS.
  38. Improved techniques for training gans. NeurIPS.
  39. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517.
  40. Markov chain monte carlo and variational inference: Bridging the gap. In ICML.
  41. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  42. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792.
  43. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
  44. Consistency models.
  45. Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600.
  46. Improved techniques for training score-based generative models. arXiv preprint arXiv:2006.09011.
  47. How to train your energy-based models. arXiv preprint arXiv:2101.03288.
  48. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
  49. Rethinking the inception architecture for computer vision. In CVPR.
  50. NVAE: A deep hierarchical variational autoencoder. NeurIPS.
  51. Conditional image generation with pixelcnn decoders. NeurIPS.
  52. Pixel recurrent neural networks. In ICML.
  53. Neural discrete representation learning. NeurIPS.
  54. Logan: Latent optimisation for generative adversarial networks. arXiv preprint arXiv:1912.00953.
  55. Reco: Region-controlled text-to-image generation. In CVPR.
  56. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365.
  57. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
  58. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018.
Citations (2)

Summary

  • The paper introduces a Cascaded Diffusion Model (Cas-DM) that integrates LPIPS loss into DDPM to improve image generation.
  • It employs a dual-module architecture that separates noise prediction from clean image estimation, enabling effective metric function integration.
  • Experiments on CIFAR10, CelebAHQ, and ImageNet reveal improved FID, sFID, and Inception Scores, highlighting its superior performance.

Introduction

The arena of visual content generation has seen a remarkable advancement with the advent of Denoising Diffusion Probabilistic Model (DDPM). DDPM stands as a robust method that employs an iterative process to generate images. A notable recent development in this field is the investigation into utilizing metric functions, specifically the Learned Perceptual Image Patch Similarity (LPIPS) loss, to enhance image generation quality. Despite showing effectiveness in consistency models, the integration of LPIPS into DDPM has remained a complex challenge, given the model's iterative nature which generates noise predictions at each step, contrasting with the metric function's need for a "clean" image.

Cascaded Diffusion Model (Cas-DM)

Addressing the integration dilemma, a novel approach named the Cascaded Diffusion Model (Cas-DM) is introduced. The Cas-DM architecture efficiently blends metric functions into the diffusion model training while maintaining the integrity of the noise prediction process. Built on a two-module network, the Cas-DM allows the first to operate similarly to standard DDPM, predicting added noise, while the second module makes a refined prediction of the clean image, suitable for the metric function's computation. This separation and cascade of tasks are central to the Cas-DM's ability to employ metric functions without disrupting the original DDPM capabilities.

Experimentation and Results

The effectiveness of integrating LPIPS loss in Cas-DM is substantiated through rigorous experiments across several benchmarks, such as CIFAR10, CelebAHQ, and ImageNet, with improved performance indicators like FID, sFID, and Inception Score. The research reveals that Cas-DM with the application of LPIPS loss not only holds up as a successful extension of DDPM but also leads the field by delivering superior image quality in comparison to previous models.

Implications for Future Work

The findings of this paper open avenues for further exploration into model architecture optimizations and the potential for various metric functions to boost diffusion model training. The adaptability of Cas-DM across different image resolutions and dataset complexities, coupled with its ability to incorporate perceptually relevant metrics, establishes a new frontier in generative model development and applications.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 35 likes.

Upgrade to Pro to view all of the tweets about this paper: