This paper introduces Dual LoRA Learning (DLoRAL), a novel framework for Real-World Video Super-Resolution (Real-VSR) that aims to generate videos with rich spatial details while maintaining temporal consistency. The core challenge in Real-VSR, especially when using pre-trained generative models like Stable Diffusion (SD), is the trade-off between detail enhancement and temporal coherence. Existing methods often sacrifice one for the other. DLoRAL addresses this by decoupling the learning of temporal consistency and spatial details using a one-step diffusion model.
The proposed DLoRAL framework leverages a pre-trained SD model and introduces two specialized Low-Rank Adaptation (LoRA) modules: a Consistency-LoRA (C-LoRA) and a Detail-LoRA (D-LoRA). These modules are trained in an alternating, dual-stage process.
Key Components and Methodology:
- One-Step Residual Diffusion: The system builds upon a one-step residual diffusion model. Instead of multiple denoising steps, it refines the low-quality (LQ) latent code to a high-quality (HQ) latent code in a single step using the formula: , where is the noise prediction network. This significantly speeds up inference.
- Cross-Frame Retrieval (CFR) Module: To exploit temporal information from degraded LQ inputs, the CFR module aggregates complementary information from adjacent frames. For a current frame and its preceding frame , their latent codes and are processed. The CFR module first aligns to 's coordinate space using SpyNet (). Then, using convolutions, it projects to query () and the aligned to key () and value () embeddings.
The fusion mechanism selectively attends to the top-k most similar positions and uses a learnable threshold for gating:
This produces a temporally enriched LQ latent .
- Dual LoRA Modules:
- Consistency-LoRA (C-LoRA): This module, along with the CFR module, is trained during the "temporal consistency stage." It learns robust temporal representations from the fused LQ latent features .
- Detail-LoRA (D-LoRA): This module is trained during the "detail enhancement stage." It focuses on restoring high-frequency spatial details.
- Dual-Stage Alternating Training:
Temporal Consistency Stage: The CFR and C-LoRA modules are trained while D-LoRA is frozen. The goal is to establish strong temporal coherence. The loss function includes pixel-level loss ( using ), LPIPS loss (), and an optical flow loss ().
* Detail Enhancement Stage: The CFR and C-LoRA modules are frozen, and D-LoRA is trained. The focus is on improving spatial visual quality while maintaining the learned consistency. The loss function includes the previous losses plus a Classifier Score Distillation (CSD) loss () to encourage richer details.
These two stages are alternated iteratively. A smooth transition between stages is achieved by interpolating the loss functions over a warm-up period.
The overall training pipeline is visualized in Figure 2 of the paper:
Image from Figure 2 of the paper, illustrating the dual-stage training of CFR, C-LoRA, and D-LoRA.
- Inference: During inference, both C-LoRA and D-LoRA are merged into the main SD UNet. The model processes LQ video frames (current frame and preceding frame ) using the CFR module and then the enhanced UNet in a single diffusion step to produce the HQ frame . This sliding-window approach processes the video sequence.
Implementation Details:
- Backbone: Pre-trained Stable Diffusion V2.1.
- Training: Batch size 16, sequence length 3, resolution , on 4 NVIDIA A100 GPUs. Adam optimizer with learning rate .
- Datasets:
- Consistency Stage: REDS dataset and curated videos from Pexels (44,162 frames).
- Enhancement Stage: LSDIR dataset, with simulated video sequences generated by random pixel-level translations.
- Degradation: RealESRGAN degradation pipeline (blur, noise, downsampling, compression).
- Testing Datasets: UDM10, SPMCS (synthetic), RealVSR, VideoLQ (real-world).
- Evaluation Metrics: PSNR, SSIM, LPIPS, DISTS, MUSIQ, MANIQA, CLIPIQA, DOVER, and average warping error () for temporal consistency.
Results and Contributions:
- Performance: DLoRAL achieves state-of-the-art performance on Real-VSR benchmarks, outperforming existing methods in perceptual quality (LPIPS, DISTS, MUSIQ, CLIPIQA, MANIQA, DOVER) while maintaining good temporal consistency ().
- Efficiency: Due to the one-step diffusion and LoRA integration, DLoRAL is significantly faster (e.g., ~10x faster than Upscale-A-Video and MGLD-VSR) and has a comparable number of parameters to other efficient methods like OSEDiff.
- Qualitative Results: Visual comparisons show DLoRAL produces sharper details, better facial reconstruction, and more legible textures compared to other methods, while temporal profiles indicate smoother transitions.
- User Study: DLoRAL was overwhelmingly preferred by human evaluators (93 out of 120 votes) against three other diffusion-based Real-VSR methods for its balance of perceptual quality and temporal consistency.
Main Contributions:
- A Dual LoRA Learning (DLoRAL) paradigm for Real-VSR that decouples temporal consistency and spatial detail learning into two dedicated LoRA modules within a one-step diffusion framework.
- A Cross-Frame Retrieval (CFR) module to extract degradation-robust temporal priors from LQ inputs, guiding both C-LoRA and D-LoRA training.
- State-of-the-art performance in Real-VSR, achieving both realistic details and temporal stability efficiently.
Limitations:
- The downsampling VAE inherited from SD makes it difficult to restore very fine-scale details (e.g., small text).
- The VAE's heavy compression might disrupt temporal coherence, making robust consistency prior extraction harder. The authors suggest a VAE specifically designed for Real-VSR could mitigate this.
In essence, DLoRAL provides a practical and effective approach to Real-VSR by cleverly managing the conflicting objectives of detail enhancement and temporal consistency through a dual LoRA architecture and a staged training strategy, all while ensuring efficient inference via a one-step diffusion process.