Papers
Topics
Authors
Recent
2000 character limit reached

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild (2401.13627v2)

Published 24 Jan 2024 in cs.CV

Abstract: We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. Leveraging multi-modal techniques and advanced generative prior, SUPIR marks a significant advance in intelligent and realistic image restoration. As a pivotal catalyst within SUPIR, model scaling dramatically enhances its capabilities and demonstrates new potential for image restoration. We collect a dataset comprising 20 million high-resolution, high-quality images for model training, each enriched with descriptive text annotations. SUPIR provides the capability to restore images guided by textual prompts, broadening its application scope and potential. Moreover, we introduce negative-quality prompts to further improve perceptual quality. We also develop a restoration-guided sampling method to suppress the fidelity issue encountered in generative-based restoration. Experiments demonstrate SUPIR's exceptional restoration effects and its novel capacity to manipulate restoration through textual prompts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (103)
  1. Lsdir dataset: A large scale dataset for image restoration. https://data.vision.ee.ethz.ch/yawli/index.html, 2023. Accessed: 2023-11-15.
  2. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF international conference on computer vision, pages 4432–4441, 2019.
  3. Ntire 2017 challenge on single image super-resolution: Dataset and study. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.
  4. Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727, 2020.
  5. Blind super-resolution kernel estimation using an internal-gan. Advances in Neural Information Processing Systems, 32, 2019.
  6. The perception-distortion tradeoff. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019.
  9. Glean: Generative latent bank for large-factor image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14245–14254, 2021.
  10. Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338, 2022.
  11. Masked image training for generalizable deep image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1692–1703, 2023a.
  12. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023b.
  13. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12312–12321, 2023c.
  14. Hierarchical integration diffusion model for realistic image deblurring. Advances in Neural Information Processing Systems, 2023d.
  15. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
  16. DeepFloyd. Deepfloyd inference framework. https://www.deepfloyd.ai/deepfloyd-if, 2023. Accessed: 2023-11-14.
  17. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  18. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  19. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
  20. Accelerating the super-resolution convolutional neural network. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 391–407. Springer, 2016.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  22. Neural sparse representation for image restoration. Advances in Neural Information Processing Systems, 33:15394–15404, 2020.
  23. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  24. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9199–9208, 2021.
  25. Blind super-resolution with iterative kernel correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1604–1613, 2019.
  26. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 633–651, 2020a.
  27. Image processing using multi-code gan prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3012–3021, 2020b.
  28. Ntire 2022 challenge on perceptual image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 951–967, 2022.
  29. Networks are slacking off: Understanding generalization problem in image deraining. Advances in Neural Information Processing Systems, 2023.
  30. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  31. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  32. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  33. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
  34. Unfolding the alternating optimization for blind super resolution. Advances in Neural Information Processing Systems, 33:5632–5643, 2020.
  35. Learning the non-differentiable optimization for blind super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2093–2102, 2021.
  36. Srflow-da: Super-resolution using normalizing flow with deep convolutional block. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 364–372, 2021.
  37. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
  38. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  39. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  40. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  41. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  42. Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
  43. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021.
  44. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  45. Reflash dropout in image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6002–6012, 2022.
  46. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021a.
  47. Flow-based kernel prior with application to blind super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10601–10610, 2021b.
  48. Efficient and degradation-adaptive network for real-world image super-resolution. In European Conference on Computer Vision, pages 574–591. Springer, 2022.
  49. Diffbir: Towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:2308.15070, 2023.
  50. Blind image super-resolution: A survey and beyond. IEEE transactions on pattern analysis and machine intelligence, 45(5):5461–5480, 2022.
  51. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  52. Visual instruction tuning. In NeurIPS, 2023b.
  53. Discovering distinctive” semantics” in super-resolution networks. arXiv preprint arXiv:2108.00406, 2021.
  54. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  55. Learning the super-resolution space with normalizing flow. ECCV, Srflow, 2020.
  56. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  57. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 2437–2445, 2020.
  58. Nonparametric blind super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 945–952, 2013.
  59. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  60. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017.
  61. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  62. Exploiting deep generative prior for versatile image restoration and manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7474–7489, 2021.
  63. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  64. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  65. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  66. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  67. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  68. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  69. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  70. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  71. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  72. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8174–8182, 2018.
  73. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  74. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  75. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  76. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2555–2563, 2023a.
  77. Exploiting diffusion prior for real-world image super-resolution. arXiv preprint arXiv:2305.07015, 2023b.
  78. Unsupervised degradation representation learning for blind super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10581–10590, 2021a.
  79. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 606–615, 2018.
  80. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9168–9178, 2021b.
  81. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1905–1914, 2021c.
  82. Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490, 2022.
  83. Component divide-and-conquer for real-world image super-resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020.
  84. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  85. Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023.
  86. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022.
  87. Gan prior embedded network for blind face restoration in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 672–681, 2021.
  88. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. arXiv preprint arXiv:2308.14469, 2023.
  89. Accurate image restoration with attention retractable transformer. In International Conference on Learning Representations (ICLR), 2023a.
  90. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7):3142–3155, 2017a.
  91. Learning deep cnn denoiser prior for image restoration. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3929–3938, 2017b.
  92. Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, 2018a.
  93. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
  94. Practical blind image denoising via swin-conv-unet and data synthesis. Machine Intelligence Research, pages 1–14, 2023b.
  95. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023c.
  96. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018b.
  97. Crafting training degradation distribution for the accuracy-generalization trade-off in real-world super-resolution. International Conference on Machine Learning (ICML), 2023d.
  98. Residual non-local attention networks for image restoration. In International Conference on Learning Representations (ICLR), 2019.
  99. Residual dense network for image restoration. IEEE transactions on pattern analysis and machine intelligence, 43(7):2480–2495, 2020.
  100. A unified conditional framework for diffusion-based image restoration. arXiv preprint arXiv:2305.20049, 2023e.
  101. Rethinking deep face restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7652–7661, 2022.
  102. Towards robust blind face restoration with codebook lookup transformer. Advances in Neural Information Processing Systems, 35:30599–30611, 2022.
  103. Disentangled inference for gans with latently invertible autoencoder. International Journal of Computer Vision, 130(5):1259–1276, 2022.
Citations (67)

Summary

  • The paper introduces SUPIR, a scalable image restoration framework that uses SDXL and extensive high-quality data to boost perceptual quality.
  • It employs a robust architecture with a degradation-resistant encoder and ZeroSFT connector to reduce computational load while enhancing output fidelity.
  • SUPIR uses restoration-guided sampling with textual prompts, outperforming state-of-the-art methods on qualitative perceptual metrics in real-world scenarios.

Scaling Up to Excellence: Model Scaling for Photo-Realistic Image Restoration

The paper entitled "Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild" introduces SUPIR, a model that leverages the principles of scaling to advance the field of image restoration (IR). By harnessing vast datasets and employing sophisticated generative models, SUPIR aims to enhance both the visual fidelity and the perceptual quality of restored images.

Introduction and Motivation

With the progression in image restoration, the demand for refined perceptual quality and intelligent processing of IR results has risen. Traditional methods grounded in generative priors have seen substantial enhancements by incorporating high-quality generative models. However, further optimizing these methods requires advancing the model scaling techniques. The authors argue that the current state of scaling in IR lacks the engineering feasibility due to constraints like computing resources and architecture design.

Core Approach and Methodology

Generative Prior and Degradation-Robust Encoder

SUPIR employs the StableDiffusion-XL (SDXL) model as its generative prior. SDXL facilitates the efficient generation of high-resolution images. By fine-tuning an encoder to be robust against degradations, SUPIR ensures the consistency of the latent representation when dealing with low-quality (LQ) images.

Extensive Data Collection

A critical component of SUPIR's training is a bespoke high-quality dataset composed of 20 million high-resolution images, each annotated with detailed text descriptions. This scale and quality of data are unprecedented in the IR domain. The dataset also includes an additional set of 70k high-resolution facial images to bolster the model's ability to restore faces effectively and 100k low-quality images generated using the SDXL model to understand negative-quality concepts better.

Model Scaling and Adaptor Design

To effectively utilize the SDXL model within an IR framework, the authors designed a scalable adaptor using the essential paradigm of ControlNet. By trimming half of the Vision Transformer blocks and introducing the ZeroSFT connector, they managed to lower the computational burden while ensuring that the model could effectively handle the high parameters of SDXL.

Restoration-Guided Sampling and Textual Prompts

One of the innovative contributions of SUPIR is the integration of textual prompts in guiding the restoration process. By adopting LLaVA multi-modal LLMs, SUPIR can understand and leverage the textual descriptions of images. Furthermore, the model employs a novel restoration-guided sampling strategy to ensure that the generated content remains faithful to the LQ input. By dynamically adjusting the restoration guidance using a hyper-parameter during the diffusion process, the proposed method attains a balance between fidelity and quality.

Experimental Results

Comparison with State-of-the-Art Methods

The evaluation on several datasets shows that SUPIR outperforms existing methods such as BSRGAN, Real-ESRGAN, StableSR, DiffBIR, and PASD. The qualitative analyses demonstrate SUPIR's unprecedented ability to restore textures and details accurately in various degraded images. Though SUPIR's numerical performance on traditional full-reference metrics like PSNR and SSIM may not always surpass others, its excellence in non-reference metrics like ManIQA, ClipIQA, and MUSIQ reflects its strength in perceptual quality, aligning more closely with human judgments.

Restoration in Real-World Scenarios

Evaluated on real-world LQ images, SUPIR not only achieves superior qualitative performance but also ranks highly in user studies, thus solidifying its practical applicability in diverse contexts, from landscape to portrait images.

Ablation Studies and Analysis

Impact of Training Data and Model Architecture

Ablation studies reveal the significance of large-scale high-quality training data and the effectiveness of the ZeroSFT connector. Comparisons with models trained on smaller datasets such as DIV2K and LSDIR underscore the necessity of expansive datasets to achieve the desired performance.

Negative Prompts and Quality Analysis

The introduction of negative prompts through classifier-free guidance significantly enhances visual quality. The authors also highlight that without negative-quality samples in the training phase, the model risks generating artifacts when processing low-quality inputs.

Implications and Future Directions

SUPIR poses substantial implications for both theoretical research and practical applications. By advancing the integration of model scaling techniques with sophisticated generative models and extensive datasets, SUPIR sets a new standard for image restoration tasks. The ability to control restoration through textual prompts offers a novel avenue for user interaction in image editing. Future developments might explore extending this framework to accommodate video restoration and expanding the multi-modal aspects to incorporate more diverse data types.

Conclusion

The authors present a comprehensive approach to scaling in image restoration, leveraging the capacities of large model architectures and extensive datasets. Their method, SUPIR, demonstrates significant progress in achieving intelligent and photo-realistic image restoration, with vast potential for future innovations in AI-driven image enhancement technologies.

Overall, this paper is a pivotal contribution to the field, opening avenues for both new research and practical applications, pushing the boundaries towards achieving unparalleled image restoration quality.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 13 tweets with 410 likes about this paper.

Youtube Logo Streamline Icon: https://streamlinehq.com