Real-World Video Super-Resolution Enhanced by Text-Guided Diffusion
Introduction
In the domain of video super-resolution (VSR), producing temporally consistent high-quality videos from low-quality originals is essential. Traditional methods have limitations in generating realistic textures and details because they usually rely on synthetic degradations or specific camera-related issues. Meanwhile, recent diffusion models hold great promise due to their generative capabilities but struggle when applied to VSR because of their stochastic nature, often leading to temporal inconsistencies.
Overcoming Temporal Inconsistency
A research initiative presents 'Upscale-A-Video,' a novel text-guided latent diffusion framework designed to upscale videos with high fidelity and temporal consistency. This solution integrates a local-global temporal strategy tailored for video data. Locally, it finetunes a pretrained image ×4 upscaling model with additional temporal layers that include 3D convolutions and temporal attention. Globally, it introduces a flow-guided recurrent latent propagation module to maintain consistency across longer sequences. This training-free module operates bidirectionally, enhancing stability by upgrading existing temporal features.
Versatility and User Control
The authors enhance their approach further by exploring text prompts and noise levels as additional conditions during inference. Text prompts guide the model to generate textures such as animal fur or textures resembling oil paintings. Adjusting the noise levels allows a balance between restoration power and the generation of refined details. Moreover, Classifier-Free Guidance is adopted, substantially improving the impact of text prompts and noise levels, thus refining video quality.
Experimental Success
Extensive experimentation showcases that 'Upscale-A-Video' outperforms current methods on synthetic, real-world, and AI-generated video benchmarks, displaying exceptional visual realism and temporal consistency. Quantitative measures like PSNR, SSIM, and LPIPS scores confirm the framework's superior restoration abilities. Qualitative analyses further underscore its impressive detail recovery and realistic texture generation, effectively leveraging text prompts and optional noise adjustment to deliver excellent fidelity and quality.
Conclusion
'Upscale-A-Video' marks a significant advancement in real-world VSR, successfully employing a text-guided latent diffusion model to enhance temporal coherence and detail generation. Its methodology offers a robust grounded foundation for future VSR tasks, particularly in real-world scenarios where temporal consistency and visual realism are crucial.