One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
(2506.15591v2)
Published 18 Jun 2025 in cs.CV and cs.AI
Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.
This paper introduces Dual LoRA Learning (DLoRAL), a novel framework for Real-World Video Super-Resolution (Real-VSR) that aims to generate videos with rich spatial details while maintaining temporal consistency. The core challenge in Real-VSR, especially when using pre-trained generative models like Stable Diffusion (SD), is the trade-off between detail enhancement and temporal coherence. Existing methods often sacrifice one for the other. DLoRAL addresses this by decoupling the learning of temporal consistency and spatial details using a one-step diffusion model.
The proposed DLoRAL framework leverages a pre-trained SD model and introduces two specialized Low-Rank Adaptation (LoRA) modules: a Consistency-LoRA (C-LoRA) and a Detail-LoRA (D-LoRA). These modules are trained in an alternating, dual-stage process.
Key Components and Methodology:
One-Step Residual Diffusion: The system builds upon a one-step residual diffusion model. Instead of multiple denoising steps, it refines the low-quality (LQ) latent code zLQ to a high-quality (HQ) latent code zHQ in a single step using the formula: zHQ=zLQ−ϵθ(zLQ), where ϵθ is the noise prediction network. This significantly speeds up inference.
Cross-Frame Retrieval (CFR) Module: To exploit temporal information from degraded LQ inputs, the CFR module aggregates complementary information from adjacent frames. For a current frame InLQ and its preceding frame In−1LQ, their latent codes znLQ and zn−1LQ are processed. The CFR module first aligns zn−1LQ to znLQ's coordinate space using SpyNet (Fwp). Then, using 1×1 convolutions, it projects znLQ to query (Qn) and the aligned Fwp(zn−1LQ) to key (Kn−1) and value (Vn−1) embeddings.
The fusion mechanism selectively attends to the top-k most similar positions and uses a learnable threshold τn[p] for gating:
This produces a temporally enriched LQ latent zˉnLQ.
Dual LoRA Modules:
Consistency-LoRA (C-LoRA): This module, along with the CFR module, is trained during the "temporal consistency stage." It learns robust temporal representations from the fused LQ latent features zˉnLQ.
Detail-LoRA (D-LoRA): This module is trained during the "detail enhancement stage." It focuses on restoring high-frequency spatial details.
Dual-Stage Alternating Training:
Temporal Consistency Stage: The CFR and C-LoRA modules are trained while D-LoRA is frozen. The goal is to establish strong temporal coherence. The loss function Lcons includes pixel-level loss (Lpix using ℓ2), LPIPS loss (Llpips), and an optical flow loss (Lopt).
Lcons=λpixLpix+λlpipsLlpips+λoptLopt
Lopt=F(InHQ,In+1HQ)−F(InGT,In+1GT)1
* Detail Enhancement Stage: The CFR and C-LoRA modules are frozen, and D-LoRA is trained. The focus is on improving spatial visual quality while maintaining the learned consistency. The loss function Lenh includes the previous losses plus a Classifier Score Distillation (CSD) loss (Lcsd) to encourage richer details.
These two stages are alternated iteratively. A smooth transition between stages is achieved by interpolating the loss functions over a warm-up period.
The overall training pipeline is visualized in Figure 2 of the paper:
Image from Figure 2 of the paper, illustrating the dual-stage training of CFR, C-LoRA, and D-LoRA.
Inference: During inference, both C-LoRA and D-LoRA are merged into the main SD UNet. The model processes LQ video frames (current frame InLQ and preceding frame In−1LQ) using the CFR module and then the enhanced UNet in a single diffusion step to produce the HQ frame InHQ. This sliding-window approach processes the video sequence.
Implementation Details:
Backbone: Pre-trained Stable Diffusion V2.1.
Training: Batch size 16, sequence length 3, resolution 512×512, on 4 NVIDIA A100 GPUs. Adam optimizer with learning rate 5×10−5.
Datasets:
Consistency Stage: REDS dataset and curated videos from Pexels (44,162 frames).
Enhancement Stage: LSDIR dataset, with simulated video sequences generated by random pixel-level translations.
Evaluation Metrics: PSNR, SSIM, LPIPS, DISTS, MUSIQ, MANIQA, CLIPIQA, DOVER, and average warping error (Ewarp∗) for temporal consistency.
Results and Contributions:
Performance: DLoRAL achieves state-of-the-art performance on Real-VSR benchmarks, outperforming existing methods in perceptual quality (LPIPS, DISTS, MUSIQ, CLIPIQA, MANIQA, DOVER) while maintaining good temporal consistency (Ewarp∗).
Efficiency: Due to the one-step diffusion and LoRA integration, DLoRAL is significantly faster (e.g., ~10x faster than Upscale-A-Video and MGLD-VSR) and has a comparable number of parameters to other efficient methods like OSEDiff.
Qualitative Results: Visual comparisons show DLoRAL produces sharper details, better facial reconstruction, and more legible textures compared to other methods, while temporal profiles indicate smoother transitions.
User Study: DLoRAL was overwhelmingly preferred by human evaluators (93 out of 120 votes) against three other diffusion-based Real-VSR methods for its balance of perceptual quality and temporal consistency.
Main Contributions:
A Dual LoRA Learning (DLoRAL) paradigm for Real-VSR that decouples temporal consistency and spatial detail learning into two dedicated LoRA modules within a one-step diffusion framework.
A Cross-Frame Retrieval (CFR) module to extract degradation-robust temporal priors from LQ inputs, guiding both C-LoRA and D-LoRA training.
State-of-the-art performance in Real-VSR, achieving both realistic details and temporal stability efficiently.
Limitations:
The 8× downsampling VAE inherited from SD makes it difficult to restore very fine-scale details (e.g., small text).
The VAE's heavy compression might disrupt temporal coherence, making robust consistency prior extraction harder. The authors suggest a VAE specifically designed for Real-VSR could mitigate this.
In essence, DLoRAL provides a practical and effective approach to Real-VSR by cleverly managing the conflicting objectives of detail enhancement and temporal consistency through a dual LoRA architecture and a staged training strategy, all while ensuring efficient inference via a one-step diffusion process.