Improving the Stability and Efficiency of Diffusion Models for Content Consistent Super-Resolution (2401.00877v2)
Abstract: The generative priors of pre-trained latent diffusion models (DMs) have demonstrated great potential to enhance the visual quality of image super-resolution (SR) results. However, the noise sampling process in DMs introduces randomness in the SR outputs, and the generated contents can differ a lot with different noise samples. The multi-step diffusion process can be accelerated by distilling methods, but the generative capacity is difficult to control. To address these issues, we analyze the respective advantages of DMs and generative adversarial networks (GANs) and propose to partition the generative SR process into two stages, where the DM is employed for reconstructing image structures and the GAN is employed for improving fine-grained details. Specifically, we propose a non-uniform timestep sampling strategy in the first stage. A single timestep sampling is first applied to extract the coarse information from the input image, then a few reverse steps are used to reconstruct the main structures. In the second stage, we finetune the decoder of the pre-trained variational auto-encoder by adversarial GAN training for deterministic detail enhancement. Once trained, our proposed method, namely content consistent super-resolution (CCSR),allows flexible use of different diffusion steps in the inference stage without re-training. Extensive experiments show that with 2 or even 1 diffusion step, CCSR can significantly improve the content consistency of SR outputs while keeping high perceptual quality. Codes and models can be found at \href{https://github.com/csslc/CCSR}{https://github.com/csslc/CCSR}.
- NTIRE 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 126–135, 2017.
- The perception-distortion tradeoff. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018.
- Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Learning a deep convolutional network for image super-resolution. In ECCV, pages 184–199, 2014.
- Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338, 2022.
- Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22367–22377, 2023.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581, 2020.
- Accelerating the super-resolution convolutional neural network. In European conference on computer vision, pages 391–407. Springer, 2016.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Generative diffusion prior for unified image restoration and enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9935–9946, 2023.
- Fourier space losses for efficient perceptual image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2360–2369, 2021.
- Blind super-resolution with iterative kernel correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1604–1613, 2019a.
- Div8k: Diverse 8k resolution image dataset. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3512–3516. IEEE, 2019b.
- Shadowdiffusion: When degradation prior meets diffusion model for shadow removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14049–14058, 2023.
- Deep back-projection networks for super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1664–1673, 2018.
- Gcfsr: a generative and controllable face super resolution method without facial and gan priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1889–1898, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Wavelet-SRNET: A wavelet-based CNN for multi-scale face super resolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 1689–1697, 2017.
- Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
- Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021.
- Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
- Best-Buddy GANs for highly detailed image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1412–1420, 2022.
- Diffusion models for image restoration and enhancement–a comprehensive survey. arXiv preprint arXiv:2308.09388, 2023.
- SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1833–1844, 2021.
- Efficient and degradation-adaptive network for real-world image super-resolution. In European Conference on Computer Vision, 2022a.
- Details or Artifacts: A locally discriminative learning approach to realistic image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5657–5666, 2022b.
- Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
- Diffbir: Towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:2308.15070, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
- Image super-resolution via latent diffusion: A sampling-space mixture of experts and frequency-augmented decoder approach. 2023.
- Image super-resolution with non-local sparse attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3517–3526, 2021.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022b.
- Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022.
- EnhanceNet: Single image super-resolution through automated texture synthesis. In Proceedings of the IEEE international conference on computer vision, pages 4491–4500, 2017.
- Resdiff: Combining cnn and diffusion model for image super-resolution. arXiv preprint arXiv:2303.08714, 2023.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 114–125, 2017.
- Image super-resolution using dense skip connections. In Proceedings of the IEEE international conference on computer vision, pages 4799–4807, 2017.
- Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2555–2563, 2023a.
- Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023b.
- Exploiting diffusion prior for real-world image super-resolution. arXiv preprint arXiv:2305.07015, 2023c.
- Lightweight non-local network for image super-resolution. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1625–1629. IEEE, 2021a.
- Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 606–615, 2018a.
- ESRGAN: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018b.
- Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9168–9178, 2021b.
- Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021c.
- Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490, 2022.
- Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, pages 1398–1402. Ieee, 2003.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Deep learning for image super-resolution: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3365–3387, 2021d.
- Dr2: Diffusion-based robust degradation remover for blind face restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1704–1713, 2023d.
- Component divide-and-conquer for real-world image super-resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020.
- Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804, 2021.
- What hinders perceptual quality of PSNR-oriented methods? arXiv preprint arXiv:2201.01034, 2022.
- Gan prior embedded network for blind face restoration in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 672–681, 2021.
- Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. arXiv preprint arXiv:2308.14469, 2023.
- Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
- Resshift: Efficient diffusion model for image super-resolution by residual shifting. arXiv preprint arXiv:2307.12348, 2023.
- Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7):3142–3155, 2017.
- Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018a.
- Efficient long-range attention network for image super-resolution. arXiv preprint arXiv:2203.06697, 2022a.
- Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pages 286–301, 2018b.
- Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018c.
- Perception-distortion balanced ADMM optimization for single-image super-resolution. In European Conference on Computer Vision, pages 108–125. Springer, 2022b.
- A unified conditional framework for diffusion-based image restoration. arXiv preprint arXiv:2305.20049, 2023b.
- Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders. arXiv preprint arXiv:2202.09671, 2022.
- Designing a better asymmetric vqgan for stablediffusion. arXiv preprint arXiv:2306.04632, 2023.
- Lingchen Sun (10 papers)
- Rongyuan Wu (11 papers)
- Zhengqiang Zhang (19 papers)
- Hongwei Yong (12 papers)
- Lei Zhang (1689 papers)
- Jie Liang (82 papers)