Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution (2506.15591v2)

Published 18 Jun 2025 in cs.CV and cs.AI

Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yujing Sun (21 papers)
  2. Lingchen Sun (10 papers)
  3. Shuaizheng Liu (7 papers)
  4. Rongyuan Wu (11 papers)
  5. Zhengqiang Zhang (19 papers)
  6. Lei Zhang (1689 papers)

Summary

This paper introduces Dual LoRA Learning (DLoRAL), a novel framework for Real-World Video Super-Resolution (Real-VSR) that aims to generate videos with rich spatial details while maintaining temporal consistency. The core challenge in Real-VSR, especially when using pre-trained generative models like Stable Diffusion (SD), is the trade-off between detail enhancement and temporal coherence. Existing methods often sacrifice one for the other. DLoRAL addresses this by decoupling the learning of temporal consistency and spatial details using a one-step diffusion model.

The proposed DLoRAL framework leverages a pre-trained SD model and introduces two specialized Low-Rank Adaptation (LoRA) modules: a Consistency-LoRA (C-LoRA) and a Detail-LoRA (D-LoRA). These modules are trained in an alternating, dual-stage process.

Key Components and Methodology:

  1. One-Step Residual Diffusion: The system builds upon a one-step residual diffusion model. Instead of multiple denoising steps, it refines the low-quality (LQ) latent code zLQz^{LQ} to a high-quality (HQ) latent code zHQz^{HQ} in a single step using the formula: zHQ=zLQϵθ(zLQ)z^{HQ} = z^{LQ} - \epsilon_\theta(z^{LQ}), where ϵθ\epsilon_\theta is the noise prediction network. This significantly speeds up inference.
  2. Cross-Frame Retrieval (CFR) Module: To exploit temporal information from degraded LQ inputs, the CFR module aggregates complementary information from adjacent frames. For a current frame InLQI_n^{LQ} and its preceding frame In1LQI_{n-1}^{LQ}, their latent codes znLQz_n^{LQ} and zn1LQz_{n-1}^{LQ} are processed. The CFR module first aligns zn1LQz_{n-1}^{LQ} to znLQz_n^{LQ}'s coordinate space using SpyNet (FwpF_{wp}). Then, using 1×11 \times 1 convolutions, it projects znLQz_n^{LQ} to query (QnQ_n) and the aligned Fwp(zn1LQ)F_{wp}(z_{n-1}^{LQ}) to key (Kn1K_{n-1}) and value (Vn1V_{n-1}) embeddings. The fusion mechanism selectively attends to the top-k most similar positions and uses a learnable threshold τn[p]\tau_n[p] for gating:

    zˉnLQ[p]=znLQ[p]+qFtopk[p]ϕ(Qn[p],Kn1[q]dτn[p])Vn1[q]\bar{z}^{LQ}_n[p] = z^{LQ}_n[p] + \sum_{q \in F_{topk}[p]} \phi \left( \frac{\langle Q_n[p], K_{n-1}[q] \rangle}{\sqrt{d} - \tau_n[p]} \right) \cdot V_{n-1}[q]

    This produces a temporally enriched LQ latent zˉnLQ\bar{z}^{LQ}_n.

  3. Dual LoRA Modules:
    • Consistency-LoRA (C-LoRA): This module, along with the CFR module, is trained during the "temporal consistency stage." It learns robust temporal representations from the fused LQ latent features zˉnLQ\bar{z}^{LQ}_n.
    • Detail-LoRA (D-LoRA): This module is trained during the "detail enhancement stage." It focuses on restoring high-frequency spatial details.
  4. Dual-Stage Alternating Training:
    • Temporal Consistency Stage: The CFR and C-LoRA modules are trained while D-LoRA is frozen. The goal is to establish strong temporal coherence. The loss function Lcons\mathcal{L}_{\text{cons}} includes pixel-level loss (Lpix\mathcal{L}_{\text{pix}} using 2\ell_2), LPIPS loss (Llpips\mathcal{L}_{\text{lpips}}), and an optical flow loss (Lopt\mathcal{L}_{\text{opt}}).

      Lcons=λpixLpix+λlpipsLlpips+λoptLopt\mathcal{L}_{\text{cons}} = \lambda_{\text{pix}} \mathcal{L}_{\text{pix}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} + \lambda_{\text{opt}} \mathcal{L}_{\text{opt}}

      Lopt=F(InHQ,In+1HQ)F(InGT,In+1GT)1\mathcal{L}_{\text{opt}} = \left\| F(I^{HQ}_n, I^{HQ}_{n+1}) - F(I_n^{\text{GT}}, I_{n+1}^{\text{GT}}) \right\|_1

* Detail Enhancement Stage: The CFR and C-LoRA modules are frozen, and D-LoRA is trained. The focus is on improving spatial visual quality while maintaining the learned consistency. The loss function Lenh\mathcal{L}_{\text{enh}} includes the previous losses plus a Classifier Score Distillation (CSD) loss (Lcsd\mathcal{L}_{\text{csd}}) to encourage richer details.

Lenh=λpixLpix+λlpipsLlpips+λoptLopt+λcsdLcsd\mathcal{L}_{\text{enh}} = \lambda_{\text{pix}} \mathcal{L}_{\text{pix}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} + \lambda_{\text{opt}} \mathcal{L}_{\text{opt}} + \lambda_{\text{csd}} \mathcal{L}_{\text{csd}}

These two stages are alternated iteratively. A smooth transition between stages is achieved by interpolating the loss functions over a warm-up period.

The overall training pipeline is visualized in Figure 2 of the paper: Training Pipeline Image from Figure 2 of the paper, illustrating the dual-stage training of CFR, C-LoRA, and D-LoRA.

  1. Inference: During inference, both C-LoRA and D-LoRA are merged into the main SD UNet. The model processes LQ video frames (current frame InLQI_n^{LQ} and preceding frame In1LQI_{n-1}^{LQ}) using the CFR module and then the enhanced UNet in a single diffusion step to produce the HQ frame InHQI_n^{HQ}. This sliding-window approach processes the video sequence.

Implementation Details:

  • Backbone: Pre-trained Stable Diffusion V2.1.
  • Training: Batch size 16, sequence length 3, resolution 512×512512 \times 512, on 4 NVIDIA A100 GPUs. Adam optimizer with learning rate 5×1055\times10^{-5}.
  • Datasets:
    • Consistency Stage: REDS dataset and curated videos from Pexels (44,162 frames).
    • Enhancement Stage: LSDIR dataset, with simulated video sequences generated by random pixel-level translations.
    • Degradation: RealESRGAN degradation pipeline (blur, noise, downsampling, compression).
  • Testing Datasets: UDM10, SPMCS (synthetic), RealVSR, VideoLQ (real-world).
  • Evaluation Metrics: PSNR, SSIM, LPIPS, DISTS, MUSIQ, MANIQA, CLIPIQA, DOVER, and average warping error (EwarpE^*_{warp}) for temporal consistency.

Results and Contributions:

  • Performance: DLoRAL achieves state-of-the-art performance on Real-VSR benchmarks, outperforming existing methods in perceptual quality (LPIPS, DISTS, MUSIQ, CLIPIQA, MANIQA, DOVER) while maintaining good temporal consistency (EwarpE^*_{warp}).
  • Efficiency: Due to the one-step diffusion and LoRA integration, DLoRAL is significantly faster (e.g., ~10x faster than Upscale-A-Video and MGLD-VSR) and has a comparable number of parameters to other efficient methods like OSEDiff.
  • Qualitative Results: Visual comparisons show DLoRAL produces sharper details, better facial reconstruction, and more legible textures compared to other methods, while temporal profiles indicate smoother transitions.
  • User Study: DLoRAL was overwhelmingly preferred by human evaluators (93 out of 120 votes) against three other diffusion-based Real-VSR methods for its balance of perceptual quality and temporal consistency.

Main Contributions:

  1. A Dual LoRA Learning (DLoRAL) paradigm for Real-VSR that decouples temporal consistency and spatial detail learning into two dedicated LoRA modules within a one-step diffusion framework.
  2. A Cross-Frame Retrieval (CFR) module to extract degradation-robust temporal priors from LQ inputs, guiding both C-LoRA and D-LoRA training.
  3. State-of-the-art performance in Real-VSR, achieving both realistic details and temporal stability efficiently.

Limitations:

  • The 8×8\times downsampling VAE inherited from SD makes it difficult to restore very fine-scale details (e.g., small text).
  • The VAE's heavy compression might disrupt temporal coherence, making robust consistency prior extraction harder. The authors suggest a VAE specifically designed for Real-VSR could mitigate this.

In essence, DLoRAL provides a practical and effective approach to Real-VSR by cleverly managing the conflicting objectives of detail enhancement and temporal consistency through a dual LoRA architecture and a staged training strategy, all while ensuring efficient inference via a one-step diffusion process.

Youtube Logo Streamline Icon: https://streamlinehq.com