Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution (2311.16518v2)

Published 27 Nov 2023 in cs.CV

Abstract: Owe to the powerful generative priors, the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However, as a consequence of the heavy quality degradation of input low-resolution (LR) images, the destruction of local structures can lead to ambiguous image semantics. As a result, the content of reproduced high-resolution image may have semantic errors, deteriorating the super-resolution performance. To address this issue, we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First, we train a degradation-aware prompt extractor, which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags, aiming to enhance the local perception ability of the T2I model, while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts encourage the T2I model to generate detailed and semantically accurate results. Furthermore, during the inference process, we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics. The source code of our method can be found at https://github.com/cswry/SeeSR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Stability.ai. https://stability.ai/stable-diffusion.
  2. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 126–135, 2017.
  3. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019.
  4. Investigating tradeoffs in real-world video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5962–5971, 2022.
  5. Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338, 2022.
  6. Human guided ground-truth generation for realistic image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14082–14091, 2023a.
  7. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12299–12310, 2021.
  8. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22367–22377, 2023b.
  9. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12312–12321, 2023c.
  10. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11065–11074, 2019.
  11. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  12. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581, 2020.
  13. Learning a deep convolutional network for image super-resolution. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer, 2014.
  14. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  15. Div8k: Diverse 8k resolution image dataset. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3512–3516. IEEE, 2019.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  17. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019.
  18. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  19. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  20. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  21. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  22. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  23. Christopher Jarzynski. Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach. Physical Review E, 56(5):5018, 1997.
  24. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  25. Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
  26. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  27. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  28. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021.
  29. Details or artifacts: A locally discriminative learning approach to realistic image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5657–5666, 2022a.
  30. Efficient and degradation-adaptive network for real-world image super-resolution. In European Conference on Computer Vision, pages 574–591. Springer, 2022b.
  31. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
  32. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023a.
  33. Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891, 2023b.
  34. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  35. Diffbir: Towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:2308.15070, 2023c.
  36. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  37. Radford M Neal. Annealed importance sampling. Statistics and computing, 11:125–139, 2001.
  38. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  39. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  40. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  41. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022a.
  42. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022b.
  43. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022c.
  44. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  46. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  47. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 114–125, 2017.
  48. Exploiting diffusion prior for real-world image super-resolution. arXiv preprint arXiv:2305.07015, 2023a.
  49. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 606–615, 2018a.
  50. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018b.
  51. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021.
  52. Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490, 2022.
  53. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  54. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  55. Component divide-and-conquer for real-world image super-resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020.
  56. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  57. Desra: Detect and delete the artifacts of gan-based real-world super-resolution models. 2023.
  58. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022.
  59. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. arXiv preprint arXiv:2308.14469, 2023.
  60. Resshift: Efficient diffusion model for image super-resolution by residual shifting. arXiv preprint arXiv:2307.12348, 2023.
  61. A simple framework for open-vocabulary segmentation and detection. In CVPR, 2023a.
  62. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. arXiv preprint arXiv:2305.15347, 2023b.
  63. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7):3142–3155, 2017.
  64. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
  65. A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing, 24(8):2579–2591, 2015.
  66. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023c.
  67. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018a.
  68. Efficient long-range attention network for image super-resolution. In European Conference on Computer Vision, pages 649–667. Springer, 2022.
  69. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pages 286–301, 2018b.
  70. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018c.
  71. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023d.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Rongyuan Wu (11 papers)
  2. Tao Yang (520 papers)
  3. Lingchen Sun (10 papers)
  4. Zhengqiang Zhang (19 papers)
  5. Shuai Li (295 papers)
  6. Lei Zhang (1689 papers)
Citations (65)

Summary

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

The paper "SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution" presents a novel approach to the enduring problem of image super-resolution (ISR) by integrating semantics-aware methodologies. The authors address the difficulties posed by heavy degradations in low-resolution images, which often lead to ambiguous semantics in the enhanced high-resolution output.

Methodological Overview

The authors propose a semantics-aware approach leveraging pre-trained text-to-image (T2I) diffusion models. This starts with training a degradation-aware prompt extractor to generate semantic prompts from low-resolution images. The prompts are divided into two categories: hard semantic prompts (image tags) and soft semantic prompts (additional representation information). These serve to fortify the T2I model's ability to generate semantically accurate details, even in harshly degraded conditions.

The SeeSR model operates in two main stages:

  1. Training Degradation-Aware Prompt Extractor (DAPE): A prompt extractor is fine-tuned to ensure it is robust against various degradations, aligning the outputs from degraded low-resolution images with those from high-resolution images. This process aims at producing accurate semantic prompts from corrupted inputs.
  2. Inference Process for Real-World ISR: The semantic prompts guide the diffusion model, aiding the generation of perceptually realistic and semantically correct high-resolution images. During this stage, the integration of low-resolution images into the initial sampling noise is introduced to mitigate the tendency of the diffusion model to produce excessive random details.

Experimental Results

The experiments conducted demonstrate substantial improvements in the generation of realistic image details and preservation of semantic integrity. The SeeSR approach outperformed traditional GAN-based methods in generating perceptually pleasing images while maintaining semantic accuracy. It exhibited superior performance on both synthetic and real-world test datasets through various metrics such as FID, DISTS, MANIQA, and MUSIQ.

Theoretical and Practical Implications

This work emphasizes the significant potential of integrating semantic awareness in ISR, underscoring the role of T2I diffusion models in managing semantic fidelity. The theoretical implications suggest a promising direction for employing large-scale pretrained models in ISR tasks, potentially transcending the inherent limitations of traditional models concerning unknown degradation issues.

Practically, SeeSR's ability to generate more visually and semantically faithful images holds promise for applications in fields that require high-quality imaging from low-quality inputs, such as medical imaging, security, and content generation.

Future Directions

The integration of semantic prompts in guiding high-quality image generation could be explored further, potentially incorporating more sophisticated prompt extraction methods. Moreover, the approach could be extended to multi-modal applications, where complementary data types are used in tandem to enhance ISR performance.

In conclusion, the SeeSR model represents a meaningful contribution towards the development of semantics-aware image super-resolution, offering both theoretical advancements and practical applications. The successful application of T2I models in mitigating real-world image degradation challenges sets a precedent for future AI developments in image processing and related fields.

Github Logo Streamline Icon: https://streamlinehq.com