Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach (2310.12004v3)
Abstract: The recent use of diffusion prior, enhanced by pre-trained text-image models, has markedly elevated the performance of image super-resolution (SR). To alleviate the huge computational cost required by pixel-based diffusion SR, latent-based methods utilize a feature encoder to transform the image and then implement the SR image generation in a compact latent space. Nevertheless, there are two major issues that limit the performance of latent-based diffusion. First, the compression of latent space usually causes reconstruction distortion. Second, huge computational cost constrains the parameter scale of the diffusion model. To counteract these issues, we first propose a frequency compensation module that enhances the frequency components from latent space to pixel space. The reconstruction distortion (especially for high-frequency information) can be significantly decreased. Then, we propose to use Sample-Space Mixture of Experts (SS-MoE) to achieve more powerful latent-based SR, which steadily improves the capacity of the model without a significant increase in inference costs. These carefully crafted designs contribute to performance improvements in largely explored 4x blind super-resolution benchmarks and extend to large magnification factors, i.e., 8x image SR benchmarks. The code is available at https://github.com/amandaluof/moe_sr.
- Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 126–135, 2017.
- ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. ArXiv, abs/2211.01324, 2022. URL https://api.semanticscholar.org/CorpusID:253254800.
- Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3086–3095, 2019.
- Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 1329–1338, 2022a.
- Real-world blind super-resolution via feature matching with implicit high-resolution priors. Proceedings of the 30th ACM International Conference on Multimedia, 2022b. URL https://api.semanticscholar.org/CorpusID:250264643.
- Ilvr: Conditioning method for denoising diffusion probabilistic models. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14347–14356, 2021. URL https://api.semanticscholar.org/CorpusID:236950721.
- Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687, 2022a.
- Diffusion posterior sampling for general noisy inverse problems. ArXiv, abs/2209.14687, 2022b. URL https://api.semanticscholar.org/CorpusID:252596252.
- Taming transformers for high-resolution image synthesis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12868–12878, 2020. URL https://api.semanticscholar.org/CorpusID:229297973.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- Generative diffusion prior for unified image restoration and enhancement. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9935–9946, 2023. URL https://api.semanticscholar.org/CorpusID:257921922.
- Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10135–10145, 2022. URL https://api.semanticscholar.org/CorpusID:253157690.
- Generative adversarial networks. Communications of the ACM, 63:139 – 144, 2014. URL https://api.semanticscholar.org/CorpusID:12209503.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020. URL https://api.semanticscholar.org/CorpusID:219955663.
- Experts weights averaging: A new general training scheme for vision transformers. arXiv preprint arXiv:2308.06093, 2023a.
- Adaptive frequency filters as efficient global token mixers. ArXiv, abs/2307.14008, 2023b. URL https://api.semanticscholar.org/CorpusID:260164502.
- Real-world super-resolution via kernel estimation and noise injection. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 466–467, 2020.
- Focal frequency loss for image reconstruction and synthesis. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13899–13909, 2020. URL https://api.semanticscholar.org/CorpusID:236985481.
- Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
- Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5148–5157, 2021.
- Zoom-to-inpaint: Image inpainting with high-frequency details. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 476–486, 2020. URL https://api.semanticscholar.org/CorpusID:229297692.
- On fast sampling of diffusion probabilistic models. ArXiv, abs/2106.00132, 2021. URL https://api.semanticscholar.org/CorpusID:235265701.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
- Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2021. URL https://api.semanticscholar.org/CorpusID:233476433.
- Efficient and degradation-adaptive network for real-world image super-resolution. In European Conference on Computer Vision, pp. 574–591. Springer, 2022.
- Diffbir: Towards blind image restoration with generative diffusion prior. 2023a. URL https://api.semanticscholar.org/CorpusID:261276317.
- Catch missing details: Image reconstruction with frequency augmented variational autoencoder. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1736–1745, 2023b. URL https://api.semanticscholar.org/CorpusID:258479784.
- Blind image super-resolution: A survey and beyond. IEEE transactions on pattern analysis and machine intelligence, 45(5):5461–5480, 2022.
- Srflow: Learning the super-resolution space with normalizing flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 715–732. Springer, 2020.
- Making a “completely blind” image quality analyzer. IEEE Signal processing letters, 20(3):209–212, 2012.
- Spectral normalization for generative adversarial networks. ArXiv, abs/1802.05957, 2018. URL https://api.semanticscholar.org/CorpusID:3366315.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, 2021. URL https://api.semanticscholar.org/CorpusID:245335086.
- Cdpmsr: Conditional diffusion probabilistic models for single image super-resolution. ArXiv, abs/2302.12831, 2023. URL https://api.semanticscholar.org/CorpusID:257220166.
- Flexible style image super-resolution using conditional objective. IEEE Access, 10:9774–9792, 2022.
- On the spectral bias of neural networks. In International Conference on Machine Learning, 2018. URL https://api.semanticscholar.org/CorpusID:53012119.
- Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022. URL https://api.semanticscholar.org/CorpusID:248097655.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
- Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34:17555–17566, 2021.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674–10685, 2021. URL https://api.semanticscholar.org/CorpusID:245335280.
- Denoising diffusion probabilistic models for robust image super-resolution in the wild. ArXiv, abs/2302.07864, 2023. URL https://api.semanticscholar.org/CorpusID:256868462.
- Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:4713–4726, 2021. URL https://api.semanticscholar.org/CorpusID:233241040.
- Deep unsupervised learning using nonequilibrium thermodynamics. ArXiv, abs/1503.03585, 2015. URL https://api.semanticscholar.org/CorpusID:14888175.
- Generative modeling by estimating gradients of the data distribution. In Neural Information Processing Systems, 2019. URL https://api.semanticscholar.org/CorpusID:196470871.
- Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2020. URL https://api.semanticscholar.org/CorpusID:227209335.
- Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 114–125, 2017.
- Exploiting diffusion prior for real-world image super-resolution. ArXiv, abs/2305.07015, 2023. URL https://api.semanticscholar.org/CorpusID:258615282.
- Recovering realistic texture in image super-resolution by deep spatial feature transform. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 606–615, 2018a. URL https://api.semanticscholar.org/CorpusID:4710407.
- Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 606–615, 2018b.
- Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pp. 0–0, 2018c.
- Real-esrgan: Training real-world blind super-resolution with pure synthetic data. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 1905–1914, 2021. URL https://api.semanticscholar.org/CorpusID:236171006.
- Zero-shot image restoration using denoising diffusion null-space model. ArXiv, abs/2212.00490, 2022a. URL https://api.semanticscholar.org/CorpusID:254125609.
- Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490, 2022b.
- Component divide-and-conquer for real-world image super-resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 101–117. Springer, 2020.
- Raphael: Text-to-image generation via large mixture of diffusion paths. ArXiv, abs/2305.18295, 2023. URL https://api.semanticscholar.org/CorpusID:258959002.
- Blind image super-resolution via contrastive representation learning. arXiv preprint arXiv:2107.00708, 2021a.
- Designing a practical degradation model for deep blind image super-resolution. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4771–4780, 2021b. URL https://api.semanticscholar.org/CorpusID:232352764.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
- Designing a better asymmetric vqgan for stablediffusion. ArXiv, abs/2306.04632, 2023. URL https://api.semanticscholar.org/CorpusID:259095977.
- Taming sparsely activated transformer with stochastic experts. arXiv preprint arXiv:2110.04260, 2021.
- Feng Luo (91 papers)
- Jinxi Xiang (14 papers)
- Jun Zhang (1008 papers)
- Xiao Han (127 papers)
- Wei Yang (349 papers)