One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution (2506.15591v2)

Published 18 Jun 2025 in cs.CV and cs.AI

Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

Authors (6)

Yujing Sun (21 papers)
Lingchen Sun (10 papers)
Shuaizheng Liu (7 papers)
Rongyuan Wu (11 papers)
Zhengqiang Zhang (19 papers)
Lei Zhang (1689 papers)

Summary

This paper introduces Dual LoRA Learning (DLoRAL), a novel framework for Real-World Video Super-Resolution (Real-VSR) that aims to generate videos with rich spatial details while maintaining temporal consistency. The core challenge in Real-VSR, especially when using pre-trained generative models like Stable Diffusion (SD), is the trade-off between detail enhancement and temporal coherence. Existing methods often sacrifice one for the other. DLoRAL addresses this by decoupling the learning of temporal consistency and spatial details using a one-step diffusion model.

The proposed DLoRAL framework leverages a pre-trained SD model and introduces two specialized Low-Rank Adaptation (LoRA) modules: a Consistency-LoRA (C-LoRA) and a Detail-LoRA (D-LoRA). These modules are trained in an alternating, dual-stage process.

Key Components and Methodology:

One-Step Residual Diffusion: The system builds upon a one-step residual diffusion model. Instead of multiple denoising steps, it refines the low-quality (LQ) latent code $z^{LQ}$ to a high-quality (HQ) latent code $z^{HQ}$ in a single step using the formula: $z^{HQ} = z^{LQ} - \epsilon_\theta(z^{LQ})$ , where $\epsilon_\theta$ is the noise prediction network. This significantly speeds up inference.
Cross-Frame Retrieval (CFR) Module: To exploit temporal information from degraded LQ inputs, the CFR module aggregates complementary information from adjacent frames. For a current frame $I_n^{LQ}$ and its preceding frame $I_{n-1}^{LQ}$ , their latent codes $z_n^{LQ}$ and $z_{n-1}^{LQ}$ are processed. The CFR module first aligns $z_{n-1}^{LQ}$ to $z_n^{LQ}$ 's coordinate space using SpyNet ( $F_{wp}$ ). Then, using $1 \times 1$ convolutions, it projects $z_n^{LQ}$ to query ( $Q_n$ ) and the aligned $F_{wp}(z_{n-1}^{LQ})$ to key ( $K_{n-1}$ ) and value ( $V_{n-1}$ ) embeddings. The fusion mechanism selectively attends to the top-k most similar positions and uses a learnable threshold $\tau_n[p]$ for gating:

$\bar{z}^{LQ}_n[p] = z^{LQ}_n[p] + \sum_{q \in F_{topk}[p]} \phi \left( \frac{\langle Q_n[p], K_{n-1}[q] \rangle}{\sqrt{d} - \tau_n[p]} \right) \cdot V_{n-1}[q]$

This produces a temporally enriched LQ latent $\bar{z}^{LQ}_n$ .
Dual LoRA Modules:
- Consistency-LoRA (C-LoRA): This module, along with the CFR module, is trained during the "temporal consistency stage." It learns robust temporal representations from the fused LQ latent features $\bar{z}^{LQ}_n$ .
- Detail-LoRA (D-LoRA): This module is trained during the "detail enhancement stage." It focuses on restoring high-frequency spatial details.
Dual-Stage Alternating Training:
- Temporal Consistency Stage: The CFR and C-LoRA modules are trained while D-LoRA is frozen. The goal is to establish strong temporal coherence. The loss function $\mathcal{L}_{\text{cons}}$ includes pixel-level loss ( $\mathcal{L}_{\text{pix}}$ using $\ell_2$ ), LPIPS loss ( $\mathcal{L}_{\text{lpips}}$ ), and an optical flow loss ( $\mathcal{L}_{\text{opt}}$ ).
  
  $\mathcal{L}_{\text{cons}} = \lambda_{\text{pix}} \mathcal{L}_{\text{pix}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} + \lambda_{\text{opt}} \mathcal{L}_{\text{opt}}$
  
  $\mathcal{L}_{\text{opt}} = \left\| F(I^{HQ}_n, I^{HQ}_{n+1}) - F(I_n^{\text{GT}}, I_{n+1}^{\text{GT}}) \right\|_1$

* Detail Enhancement Stage: The CFR and C-LoRA modules are frozen, and D-LoRA is trained. The focus is on improving spatial visual quality while maintaining the learned consistency. The loss function $\mathcal{L}_{\text{enh}}$ includes the previous losses plus a Classifier Score Distillation (CSD) loss ( $\mathcal{L}_{\text{csd}}$ ) to encourage richer details.

$\mathcal{L}_{\text{enh}} = \lambda_{\text{pix}} \mathcal{L}_{\text{pix}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} + \lambda_{\text{opt}} \mathcal{L}_{\text{opt}} + \lambda_{\text{csd}} \mathcal{L}_{\text{csd}}$

These two stages are alternated iteratively. A smooth transition between stages is achieved by interpolating the loss functions over a warm-up period.

The overall training pipeline is visualized in Figure 2 of the paper: Image from Figure 2 of the paper, illustrating the dual-stage training of CFR, C-LoRA, and D-LoRA.

Inference: During inference, both C-LoRA and D-LoRA are merged into the main SD UNet. The model processes LQ video frames (current frame $I_n^{LQ}$ and preceding frame $I_{n-1}^{LQ}$ ) using the CFR module and then the enhanced UNet in a single diffusion step to produce the HQ frame $I_n^{HQ}$ . This sliding-window approach processes the video sequence.

Implementation Details:

Backbone: Pre-trained Stable Diffusion V2.1.
Training: Batch size 16, sequence length 3, resolution $512 \times 512$ , on 4 NVIDIA A100 GPUs. Adam optimizer with learning rate $5\times10^{-5}$ .
Datasets:
- Consistency Stage: REDS dataset and curated videos from Pexels (44,162 frames).
- Enhancement Stage: LSDIR dataset, with simulated video sequences generated by random pixel-level translations.
- Degradation: RealESRGAN degradation pipeline (blur, noise, downsampling, compression).
Testing Datasets: UDM10, SPMCS (synthetic), RealVSR, VideoLQ (real-world).
Evaluation Metrics: PSNR, SSIM, LPIPS, DISTS, MUSIQ, MANIQA, CLIPIQA, DOVER, and average warping error ( $E^*_{warp}$ ) for temporal consistency.

Results and Contributions:

Performance: DLoRAL achieves state-of-the-art performance on Real-VSR benchmarks, outperforming existing methods in perceptual quality (LPIPS, DISTS, MUSIQ, CLIPIQA, MANIQA, DOVER) while maintaining good temporal consistency ( $E^*_{warp}$ ).
Efficiency: Due to the one-step diffusion and LoRA integration, DLoRAL is significantly faster (e.g., ~10x faster than Upscale-A-Video and MGLD-VSR) and has a comparable number of parameters to other efficient methods like OSEDiff.
Qualitative Results: Visual comparisons show DLoRAL produces sharper details, better facial reconstruction, and more legible textures compared to other methods, while temporal profiles indicate smoother transitions.
User Study: DLoRAL was overwhelmingly preferred by human evaluators (93 out of 120 votes) against three other diffusion-based Real-VSR methods for its balance of perceptual quality and temporal consistency.

Main Contributions:

A Dual LoRA Learning (DLoRAL) paradigm for Real-VSR that decouples temporal consistency and spatial detail learning into two dedicated LoRA modules within a one-step diffusion framework.
A Cross-Frame Retrieval (CFR) module to extract degradation-robust temporal priors from LQ inputs, guiding both C-LoRA and D-LoRA training.
State-of-the-art performance in Real-VSR, achieving both realistic details and temporal stability efficiently.

Limitations:

The $8\times$ downsampling VAE inherited from SD makes it difficult to restore very fine-scale details (e.g., small text).
The VAE's heavy compression might disrupt temporal coherence, making robust consistency prior extraction harder. The authors suggest a VAE specifically designed for Real-VSR could mitigate this.

In essence, DLoRAL provides a practical and effective approach to Real-VSR by cleverly managing the conflicting objectives of detail enhancement and temporal consistency through a dual LoRA architecture and a staged training strategy, all while ensuring efficient inference via a one-step diffusion process.

PDF Markdown

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution (2506.15591v2)

Summary

Related Papers

GitHub

YouTube