Bring Metric Functions into Diffusion Models (2401.02414v1)
Abstract: We introduce a Cascaded Diffusion Model (Cas-DM) that improves a Denoising Diffusion Probabilistic Model (DDPM) by effectively incorporating additional metric functions in training. Metric functions such as the LPIPS loss have been proven highly effective in consistency models derived from the score matching. However, for the diffusion counterparts, the methodology and efficacy of adding extra metric functions remain unclear. One major challenge is the mismatch between the noise predicted by a DDPM at each step and the desired clean image that the metric function works well on. To address this problem, we propose Cas-DM, a network architecture that cascades two network modules to effectively apply metric functions to the diffusion model training. The first module, similar to a standard DDPM, learns to predict the added noise and is unaffected by the metric function. The second cascaded module learns to predict the clean image, thereby facilitating the metric function computation. Experiment results show that the proposed diffusion model backbone enables the effective use of the LPIPS loss, leading to state-of-the-art image quality (FID, sFID, IS) on various established benchmarks.
- Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477.
- Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, 226–234. PMLR.
- Dynamic dual-output diffusion models. In CVPR.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
- Your gan is secretly an energy-based model and you should use discriminator driven latent sampling. NeurIPS.
- Pixelsnail: An improved autoregressive generative model. In ICML.
- Imagenet: a large-scale hierarchical image database. In CVPR.
- Diffusion models beat gans on image synthesis. In NeurIPS.
- Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
- Density estimation using real nvp. In ICLR.
- Taming transformers for high-resolution image synthesis. In CVPR.
- Frido: Feature pyramid diffusion for complex scene image synthesis. In AAAI.
- Generative adversarial nets. In NeurIPS.
- Deep residual learning for image recognition. In CVPR.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.
- Flow++: Improving flow-based generative models with variational dequantization and architecture design. In ICML.
- Denoising diffusion probabilistic models. NeurIPS.
- Perceptual losses for real-time style transfer and super-resolution. In ECCV.
- Adversarial score matching and improved sampling for image generation. arXiv preprint arXiv:2009.05475.
- Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
- A style-based generator architecture for generative adversarial networks. In CVPR.
- Analyzing and improving the image quality of stylegan. In CVPR.
- Glow: Generative flow with invertible 1x1 convolutions. In NeurIPS.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Learning multiple layers of features from tiny images.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
- Least squares generative adversarial networks. In ICCV.
- Generating images with sparse representations. arXiv preprint arXiv:2103.03841.
- Improved denoising diffusion probabilistic models. In ICML.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Generating diverse high-fidelity images with vq-vae-2. NeurIPS.
- High-resolution image synthesis with latent diffusion models. In CVPR.
- Palette: Image-to-image diffusion models. In SIGGRAPH.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS.
- Improved techniques for training gans. NeurIPS.
- Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517.
- Markov chain monte carlo and variational inference: Bridging the gap. In ICML.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
- Consistency models.
- Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600.
- Improved techniques for training score-based generative models. arXiv preprint arXiv:2006.09011.
- How to train your energy-based models. arXiv preprint arXiv:2101.03288.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
- Rethinking the inception architecture for computer vision. In CVPR.
- NVAE: A deep hierarchical variational autoencoder. NeurIPS.
- Conditional image generation with pixelcnn decoders. NeurIPS.
- Pixel recurrent neural networks. In ICML.
- Neural discrete representation learning. NeurIPS.
- Logan: Latent optimisation for generative adversarial networks. arXiv preprint arXiv:1912.00953.
- Reco: Region-controlled text-to-image generation. In CVPR.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365.
- The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.