Exploiting Diffusion Prior for Real-World Image Super-Resolution (2305.07015v4)

Published 11 May 2023 in cs.CV

Abstract: We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrapping module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.

PDF Abstract

Exploiting Diffusion Prior for Real-World Image Super-Resolution

The paper introduces a novel methodology named StableSR, which leverages pre-trained diffusion models, specifically Stable Diffusion, to tackle blind super-resolution (SR) tasks. This approach is notable for its ability to use generative priors without extensive retraining, minimizing computational overhead and maintaining high-quality output.

Methodology

StableSR operates by using a time-aware encoder that interfaces with a frozen pre-trained diffusion model. The encoder modulates the diffusion process through a spatial feature transformation (SFT) to extract and use features from low-resolution (LR) images, a method that ensures high fidelity without compromising the generative capabilities of the diffusion model.

Key Components

Time-Aware Encoder: The encoder adapts the model’s response based on time-embedded features, delivering condition strength adaptable to the signal-to-noise ratio (SNR) during diffusion. This adaptability ensures that the model can generate accurate high-frequency details necessary for SR tasks.
Controllable Feature Wrapping (CFW): Inspired by CodeFormer, CFW allows users to manipulate a trade-off between realism and fidelity by adjusting an interpolation coefficient. This feature is crucial for tailoring the output to specific applications, providing flexibility in balancing quality constraints.
Progressive Aggregation Sampling: Overcomes the limitation of fixed input resolution by dividing the image into overlapping patches. These patches are processed independently and then aggregated using a Gaussian weighted approach, thus preserving continuity across patch boundaries and allowing the model to handle arbitrary sized inputs.

Experimental Evaluation

The paper reports substantial improvements over current state-of-the-art methods across synthetic and real-world scenarios. Strong numerical results include a favorable FID score and high CLIP-IQA values, indicating superior perceptual quality. The model is particularly effective in maintaining texture details and reducing artifacts compared to rivals like BSRGAN and Real-ESRGAN+.

Implications

The proposed methods provide a robust alternative to conventionally trained SR models, emphasizing reduced computational requirements and training costs by fine-tuning pre-existing models. This innovation could pave the way for more efficient development pipelines in content creation tasks where fidelity and realism are crucial.

Future Directions

Looking forward, future work could explore:

Enhancements in computational efficiency through acceleration strategies like model distillation or fast sampling methods.
More extensive exploration of prompt engineering to further refine classifier-free guidance within the diffusion models.
New mechanisms to handle varying degradation patterns beyond those tested, enhancing the applicability of diffusion models to broader datasets.

In conclusion, StableSR presents a refined approach balancing computational efficiency and output quality by leveraging existing generative models in innovative ways. This trajectory offers promising opportunities for real-world image applications, providing a strong baseline in the use of diffusion priors for super-resolution tasks.