Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diffusion Model with Perceptual Loss (2401.00110v6)

Published 30 Dec 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Diffusion models without guidance tend to generate unrealistic samples, yet the cause of this problem is not fully studied. Our analysis suggests that the loss objective plays an important role in shaping the learned distribution and the common mean squared error loss is not optimal. We hypothesize that a better loss objective can be designed with inductive biases and propose a novel self-perceptual loss that utilizes the diffusion model itself as the perceptual loss. Our work demonstrates that perceptual loss can be used in diffusion training to improve sample quality effectively. Models trained using our objective can generate realistic samples without guidance. We hope our work paves the way for more future explorations of the diffusion loss objective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
  2. Align your latents: High-resolution video synthesis with latent diffusion models, 2023.
  3. Magicdance: Realistic human dance video generation with motions & facial expressions transfer, 2023.
  4. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  5. Diffusion models beat gans on image synthesis, 2021.
  6. Structure and content-guided video synthesis with diffusion models, 2023.
  7. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2023.
  8. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
  9. Imagen video: High definition video generation with diffusion models, 2022.
  10. Denoising diffusion probabilistic models, 2020.
  11. Classifier-free diffusion guidance, 2022.
  12. Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005.
  13. Elucidating the design space of diffusion-based generative models, 2022.
  14. Analyzing and improving the training dynamics of diffusion models, 2023.
  15. Common diffusion noise schedules and sample steps are flawed, 2023.
  16. Microsoft coco: Common objects in context, 2015.
  17. Flow matching for generative modeling, 2023.
  18. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022.
  19. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps, 2022.
  20. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models, 2023.
  21. On aliased resizing and surprising subtleties in gan evaluation, 2022.
  22. Scalable diffusion models with transformers, 2023.
  23. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023.
  24. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  25. Dreamfusion: Text-to-3d using 2d diffusion, 2022.
  26. Learning transferable visual models from natural language supervision, 2021.
  27. Hierarchical text-conditional image generation with clip latents, 2022.
  28. High-resolution image synthesis with latent diffusion models, 2022.
  29. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  30. Improved techniques for training gans, 2016.
  31. Progressive distillation for fast sampling of diffusion models, 2022.
  32. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  33. Mvdream: Multi-view diffusion for 3d generation, 2023.
  34. Make-a-video: Text-to-video generation without text-video data, 2022.
  35. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
  36. Denoising diffusion implicit models, 2022.
  37. Score-based generative modeling through stochastic differential equations, 2021.
  38. Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661–1674, 07 2011.
  39. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, dec 2010.
  40. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004.
  41. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation, 2023.
  42. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402 Vol.2, 2003.
  43. Magicprop: Diffusion-based video editing via motion-aware appearance propagation, 2023.
  44. The unreasonable effectiveness of deep features as a perceptual metric, 2018.
  45. Movq: Modulating quantized vectors for high-fidelity image generation, 2022.
  46. Magicvideo: Efficient video generation with latent diffusion models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shanchuan Lin (17 papers)
  2. Xiao Yang (158 papers)
Citations (12)

Summary

Introduction to Diffusion Models

Diffusion models are innovative generative models designed to transform random noise into structured and meaningful data, such as images, through a process of denoising. The procedure to create new samples can be thought of as a reverse simulation where noise is incrementally removed to uncover the data representation. These models have achieved remarkable success in image generation and their capabilities extend to other forms of media.

The standard training for diffusion models involves an approach known as the mean squared error (MSE) loss function. While this method is conceptually straightforward, it has some shortcomings in producing highly realistic images. To address this, state-of-the-art models have employed techniques like classifier-free guidance, which has shown to enhance image quality significantly, but the reasons behind its effectiveness were not entirely clear until now.

Perceptual Loss and Improved Sample Quality

This paper reveals that the notable performance of classifier-free guidance in producing high-quality samples is due, in part, to its implicit use of perceptual guidance. The idea is to integrate perceptual loss – a measure more aligned with human visual perception – directly within the training of diffusion models. Notably, the diffusion model itself can serve as an effective perceptual network, negating the need for external perceptual networks. By leveraging this intrinsic capability, the researchers propose a new training objective known as the self-perceptual objective.

Advantages of Self-Perceptual Objective

The proposed self-perceptual objective has multiple advantages:

  • It enhances the realism of generated images without compromising diversity for conditional generation - a common issue with classifier-free guidance.
  • Unlike classifier-free guidance, which is specialized for improving conditional models, the self-perceptual objective is also capable of boosting the quality of unconditional models, where no guidance is provided by annotations or labels.
  • It is designed to avoid the limitations of classifier-free guidance, such as overexposure and over-saturation artifacts that can appear at strong guidance scales.

Evaluating the New Objective

The researchers conducted thorough evaluations, including both qualitative and quantitative analyses, across a variety of datasets and conditions. The findings confirm that self-perceptual training offers a meaningful increase in sample quality over the traditional MSE loss. However, in text-to-image generation, classifier-free guidance still produces the best overall images because it enhances alignment with the text prompts at the expense of sample diversity.

In summary, this paper presents a promising direction for training diffusion models with perceptual loss to improve the realism of generated images. The self-perceptual objective acts as a powerful tool for future developments in generative modeling, particularly within the realms of image, video, and audio synthesis.

The implementation provided outlines a PyTorch-based approach to self-perceptual training, showcasing a practical means by which researchers and AI practitioners can apply these findings to their own work.