Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlashVSR: Real-Time Diffusion VSR

Updated 9 March 2026
  • FlashVSR is a diffusion-based framework designed to deliver real-time video super-resolution by minimizing latency and reducing computational cost.
  • It integrates a distilled diffusion transformer, locality-constrained sparse attention, and a tiny conditional decoder to achieve high-fidelity results at ultra-high resolutions.
  • Trained on the large-scale VSR-120K dataset, FlashVSR attains state-of-the-art PSNR, SSIM, and LPIPS metrics while scaling efficiently on commodity hardware.

FlashVSR is a diffusion-based framework for real-time video super-resolution (VSR) that addresses latency, computational cost, and poor generalization at ultra-high resolutions. It introduces a fast, one-step streaming pipeline utilizing distilled diffusion transformers, a sparse attention mechanism specifically tailored for large spatial domains, and a compact conditional decoder, all trained on a newly constructed large-scale dataset. This architecture achieves state-of-the-art performance metrics and speed, scaling efficiently to ultra-high resolutions on commodity hardware (Zhuang et al., 14 Oct 2025).

1. System Overview and Objectives

FlashVSR is designed to enable diffusion-based VSR in a real-time, streaming setting for resolutions up to 1440p+, directly overcoming three critical obstacles:

  • Look-ahead latency from chunk-wise inference,
  • Quadratic scaling of dense 3D attention in space-time,
  • Train–test resolution gaps with conventional positional encoding.

The framework leverages a one-step “distilled” diffusion transformer (DiT) model, which predicts high-frequency latent representations from Gaussian noise and the current low-resolution (LR) frame, maintaining temporal continuity via a causal key-value (KV) cache. FlashVSR incorporates three technical innovations:

Training is conducted on VSR-120K, a new dataset containing 120,000 video clips and 180,000 high-quality images.

2. One-Step Diffusion-Based Streaming Pipeline

The method reduces conventional multi-step diffusion to a single-step denoising operation for each video frame. The one-step denoiser GoneG_{\mathrm{one}} operates causally, producing a latent ztz_t for frame tt via:

zt=Gone(LRt,ϵt;KV<t)z_t = G_{\mathrm{one}}(\mathrm{LR}_t, \epsilon_t; \mathrm{KV}_{<t})

where LRt\mathrm{LR}_t is the current LR frame, ϵtN(0,I)\epsilon_t \sim \mathcal{N}(0, I) is fresh noise, and KV<t\mathrm{KV}_{<t} caches temporal latent context.

The composite loss for training the one-step model is:

L=LDMD(zpred,Gone,Greal,Gfake)Distribution-matching distillation+LFM(zpred,Gfake)Flow matching+xpredxgt22+λLLPIPS(xpred,xgt)Decoder reconstruction\mathcal{L} = \underbrace{\mathcal{L}_{\mathrm{DMD}}(z_{\mathrm{pred}}, G_{\mathrm{one}}, G_{\mathrm{real}}, G_{\mathrm{fake}})}_{\text{Distribution-matching distillation}} + \underbrace{\mathcal{L}_{\mathrm{FM}}(z_{\mathrm{pred}}, G_{\mathrm{fake}})}_{\text{Flow matching}} + \underbrace{\|x_{\mathrm{pred}} - x_{\mathrm{gt}}\|_2^2 + \lambda \mathcal{L}_{\mathrm{LPIPS}}(x_{\mathrm{pred}}, x_{\mathrm{gt}})}_{\text{Decoder reconstruction}}

with λ=2\lambda = 2. Here, DMD aligns the student DiT latent distribution with the full-attention teacher, FM enforces consistency of score fields, and the decoder loss ensures fidelity to high-resolution ground truth.

Inference is executed in a single pass per frame: GoneztG_{\mathrm{one}} \rightarrow z_t, then TC Decoder xpred\rightarrow x_{\mathrm{pred}}.

3. Three-Stage Distillation Pipeline

The distillation process comprises the following stages:

  • Stage 1: Image–Video Joint Full-Attention Teacher
    • Uses a large pretrained DiT (WAN 2.1-1.3B), jointly trained on K-frame video and single images (as 1-frame clips) with full spatiotemporal attention.
    • Applies a block-diagonal segment mask in cross-attention:

    αij=exp(qikj/d)1[seg(i)=seg(j)]lexp(qikl/d)1[seg(i)=seg(l)]\alpha_{ij} = \frac{\exp(q_i k_j^\top/\sqrt{d})\,\mathbf{1}[\mathrm{seg}(i)=\mathrm{seg}(j)]}{\sum_l \exp(q_i k_l^\top/\sqrt{d})\,\mathbf{1}[\mathrm{seg}(i)=\mathrm{seg}(l)]} - Loss: flow matching on latents and standard decoder reconstruction.

  • Stage 2: Sparse-Causal Streaming Adaptation

    • Converts the full-attention model to streaming by introducing a causal mask and block-sparse attention, partitioning tokens into (T=2, H=8, W=8) blocks, selecting top-kk attention blocks per query.
    • Attention cost is reduced to 10–20% of the full model.
  • Stage 3: One-Step Distribution-Matching Distillation
    • Distills the sparse-causal model into GoneG_{\mathrm{one}}, minimizing both distributional and flow-matching losses, plus final decoder error.
    • Training operates in parallel across frames, closing the train–test gap.

4. Locality-Constrained Sparse Attention

To address artifacts caused by repeated positional encodings at test-time resolutions that exceed training bounds, locality-constrained sparse attention restricts attention by spatial proximity, in addition to block-level top-kk selection. For each query token pip_i, keys within a fixed window radius RR are attended:

Mij={1,if block(j)TopK(block(i))pipjR 0,otherwiseM_{ij} = \begin{cases} 1, & \text{if block}(j)\in \text{TopK}(\text{block}(i)) \wedge \|p_i - p_j\| \leq R \ 0, & \text{otherwise} \end{cases}

Normalized attention:

αij=exp(qikj/d)Mijlexp(qikl/d)Mil\alpha_{ij} = \frac{\exp(q_i k_j^\top/\sqrt{d}) M_{ij}}{\sum_l \exp(q_i k_l^\top/\sqrt{d}) M_{il}}

Two handling strategies for boundary queries are used: Boundary-Truncated (dropping queries outside the window) and Boundary-Preserved (clamping to bounds). Complexity is O(N(bk+windowb))O(N \cdot (b \cdot k + \text{window} \cdot b)), where NN is token count, bb block size, and kk the top blocks attended, yielding a 5–10×\times reduction in 3D attention cost.

5. Tiny Conditional Decoder (TC Decoder)

After distillation, FlashVSR replaces the original 3D VAE decoder—which incurred \sim70% of runtime—with a lightweight TC Decoder. This module conditions on the latent ztz_t at 1/8 spatial resolution and LR frame features, leveraging a shallow UNet-style architecture with pixel-shuffle upsampling to target resolution. Parameter count is reduced (1.75B vs 3.4B), and decode time drops to 1/7. The conditioning leads to higher accuracy than a purely latent-conditioned “tiny” decoder.

The loss for TC Decoder training is:

LTC=xpredxgt22+λLLPIPS(xpred,xgt)+xpredxwan22+λLLPIPS(xpred,xwan)\mathcal{L}_{\mathrm{TC}} = \|x_{\mathrm{pred}} - x_{\mathrm{gt}}\|_2^2 + \lambda \mathcal{L}_{\mathrm{LPIPS}}(x_{\mathrm{pred}}, x_{\mathrm{gt}}) + \|x_{\mathrm{pred}} - x_{\mathrm{wan}}\|_2^2 + \lambda \mathcal{L}_{\mathrm{LPIPS}}(x_{\mathrm{pred}}, x_{\mathrm{wan}})

with λ=2\lambda = 2.

6. Dataset Construction and Training Procedure

VSR-120K, a new large-scale training dataset, contains 120K video clips (average >>350 frames) and 180K images above $1080$p, sourced from Videvo, Pexels, and Pixabay. Filtering uses LAION-Aesthetic and MUSIQ metrics for quality, combined with RAFT optical flow to verify motion presence. Data degradation for robust VSR is applied following RealBasicVSR (blur, noise, compression).

Training utilizes 32×\timesA100-80GB GPUs with AdamW (lr = 1×1051 \times 10^{-5}, weight decay = 0.01, batch = 32). LoRA adapters (rank=384) facilitate WAN 2.1 weight fine-tuning. Stage durations: Stage 1—2 days, Stage 2—1 day, Stage 3—2 days, TC Decoder—2 days.

7. Quantitative Benchmarks and Implementation

FlashVSR achieves state-of-the-art PSNR, SSIM, and LPIPS metrics across standard synthetic (YouHQ40, REDS, SPMCS) and real (VideoLQ, AIGC30) datasets. For instance, on YouHQ40, PSNR = 24.39, SSIM = 0.6651, and LPIPS = 0.3866 for FlashVSR-Tiny.

A comparative summary:

Method Peak Mem (GB) Runtime (s) / FPS Params (M)
Upscale-A-Video 18.4 811.7 / 0.12 1087
STAR 24.9 682.5 / 0.15 2493
DOVE 25.4 72.8 / 1.39 10,549
SeedVR2-3B 52.9 70.6 / 1.43 3391
Ours-Full 18.3 15.5 / 6.52 1780
Ours-Tiny 11.1 5.97 / 16.92 1752

Look-ahead latency is minimized to 8 frames, advantageous over previous methods (e.g., STAR at 32 frames). FlashVSR-Tiny attains up to 17 FPS at 768×\times1408 on a single A100 GPU, with 11.1GB memory consumption.

Ablation studies indicate:

  • Sparse attention (13.6% density) maintains PSNR close to dense (24.11 vs 24.65 on REDS) with 3.1×\times speedup.
  • The TC Decoder offers a 7×\times speedup vs the Wan decoder (PSNR 31.08 vs 32.58; LPIPS 0.1014 vs 0.0715).
  • Boundary-Preserved locality-constrained attention yields highest fidelity (PSNR 24.87, SSIM 0.7232, LPIPS 0.3304).

Implementation advisories include matching local window scale to training positional range, using block size 128 for sparse attention, and adopting RealBasicVSR degradation to ensure real-world robustness.

8. Practical Considerations and Reproducibility

Default KV-cache eviction employs a sliding window across the last LL frames, as more sophisticated importance-based schemes underperform. Cross-attention with text prompts is disabled via a fixed prompt to avoid computational overhead. All model, code, and dataset resources, as well as detailed hyperparameter settings, are released to support reproducibility. Important practices include careful calibration of local window and batch size, consistent use of LoRA for efficient adaptation, and reliance on real-world degradation pipelines during training to ensure domain generalization.

In summary, FlashVSR unifies a one-step distilled diffusion transformer, temporally causal streaming inference, locality-constrained sparse attention, and a highly efficient conditional decoder, delivering SOTA VSR quality at near real-time frame rates on single-GPU hardware and robust scalability to ultra-high resolutions (Zhuang et al., 14 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashVSR Framework.