FlashVSR: Real-Time Diffusion VSR

Updated 9 March 2026

FlashVSR is a diffusion-based framework designed to deliver real-time video super-resolution by minimizing latency and reducing computational cost.
It integrates a distilled diffusion transformer, locality-constrained sparse attention, and a tiny conditional decoder to achieve high-fidelity results at ultra-high resolutions.
Trained on the large-scale VSR-120K dataset, FlashVSR attains state-of-the-art PSNR, SSIM, and LPIPS metrics while scaling efficiently on commodity hardware.

FlashVSR is a diffusion-based framework for real-time video super-resolution (VSR) that addresses latency, computational cost, and poor generalization at ultra-high resolutions. It introduces a fast, one-step streaming pipeline utilizing distilled diffusion transformers, a sparse attention mechanism specifically tailored for large spatial domains, and a compact conditional decoder, all trained on a newly constructed large-scale dataset. This architecture achieves state-of-the-art performance metrics and speed, scaling efficiently to ultra-high resolutions on commodity hardware (Zhuang et al., 14 Oct 2025).

1. System Overview and Objectives

FlashVSR is designed to enable diffusion-based VSR in a real-time, streaming setting for resolutions up to 1440p+, directly overcoming three critical obstacles:

Look-ahead latency from chunk-wise inference,
Quadratic scaling of dense 3D attention in space-time,
Train–test resolution gaps with conventional positional encoding.

The framework leverages a one-step “distilled” diffusion transformer (DiT) model, which predicts high-frequency latent representations from Gaussian noise and the current low-resolution (LR) frame, maintaining temporal continuity via a causal key-value (KV) cache. FlashVSR incorporates three technical innovations:

A three-stage distillation pipeline facilitating one-step, high-fidelity streaming super-resolution,
Locality-constrained sparse attention for efficient, scalable memory and compute,
A tiny conditional decoder (TC Decoder) for efficient reconstruction.

Training is conducted on VSR-120K, a new dataset containing 120,000 video clips and 180,000 high-quality images.

2. One-Step Diffusion-Based Streaming Pipeline

The method reduces conventional multi-step diffusion to a single-step denoising operation for each video frame. The one-step denoiser $G_{\mathrm{one}}$ operates causally, producing a latent $z_t$ for frame $t$ via:

$z_t = G_{\mathrm{one}}(\mathrm{LR}_t, \epsilon_t; \mathrm{KV}_{<t})$

where $\mathrm{LR}_t$ is the current LR frame, $\epsilon_t \sim \mathcal{N}(0, I)$ is fresh noise, and $\mathrm{KV}_{<t}$ caches temporal latent context.

The composite loss for training the one-step model is:

$\mathcal{L} = \underbrace{\mathcal{L}_{\mathrm{DMD}}(z_{\mathrm{pred}}, G_{\mathrm{one}}, G_{\mathrm{real}}, G_{\mathrm{fake}})}_{\text{Distribution-matching distillation}} + \underbrace{\mathcal{L}_{\mathrm{FM}}(z_{\mathrm{pred}}, G_{\mathrm{fake}})}_{\text{Flow matching}} + \underbrace{\|x_{\mathrm{pred}} - x_{\mathrm{gt}}\|_2^2 + \lambda \mathcal{L}_{\mathrm{LPIPS}}(x_{\mathrm{pred}}, x_{\mathrm{gt}})}_{\text{Decoder reconstruction}}$

with $\lambda = 2$ . Here, DMD aligns the student DiT latent distribution with the full-attention teacher, FM enforces consistency of score fields, and the decoder loss ensures fidelity to high-resolution ground truth.

Inference is executed in a single pass per frame: $G_{\mathrm{one}} \rightarrow z_t$ , then TC Decoder $\rightarrow x_{\mathrm{pred}}$ .

3. Three-Stage Distillation Pipeline

The distillation process comprises the following stages:

Stage 1: Image–Video Joint Full-Attention Teacher
- Uses a large pretrained DiT (WAN 2.1-1.3B), jointly trained on K-frame video and single images (as 1-frame clips) with full spatiotemporal attention.
- Applies a block-diagonal segment mask in cross-attention:
$\alpha_{ij} = \frac{\exp(q_i k_j^\top/\sqrt{d})\,\mathbf{1}[\mathrm{seg}(i)=\mathrm{seg}(j)]}{\sum_l \exp(q_i k_l^\top/\sqrt{d})\,\mathbf{1}[\mathrm{seg}(i)=\mathrm{seg}(l)]}$ - Loss: flow matching on latents and standard decoder reconstruction.
Stage 2: Sparse-Causal Streaming Adaptation
- Converts the full-attention model to streaming by introducing a causal mask and block-sparse attention, partitioning tokens into (T=2, H=8, W=8) blocks, selecting top- $k$ attention blocks per query.
- Attention cost is reduced to 10–20% of the full model.
Stage 3: One-Step Distribution-Matching Distillation
- Distills the sparse-causal model into $G_{\mathrm{one}}$ , minimizing both distributional and flow-matching losses, plus final decoder error.
- Training operates in parallel across frames, closing the train–test gap.

4. Locality-Constrained Sparse Attention

To address artifacts caused by repeated positional encodings at test-time resolutions that exceed training bounds, locality-constrained sparse attention restricts attention by spatial proximity, in addition to block-level top- $k$ selection. For each query token $p_i$ , keys within a fixed window radius $R$ are attended:

$M_{ij} = \begin{cases} 1, & \text{if block}(j)\in \text{TopK}(\text{block}(i)) \wedge \|p_i - p_j\| \leq R \ 0, & \text{otherwise} \end{cases}$

Normalized attention:

$\alpha_{ij} = \frac{\exp(q_i k_j^\top/\sqrt{d}) M_{ij}}{\sum_l \exp(q_i k_l^\top/\sqrt{d}) M_{il}}$

Two handling strategies for boundary queries are used: Boundary-Truncated (dropping queries outside the window) and Boundary-Preserved (clamping to bounds). Complexity is $O(N \cdot (b \cdot k + \text{window} \cdot b))$ , where $N$ is token count, $b$ block size, and $k$ the top blocks attended, yielding a 5–10 $\times$ reduction in 3D attention cost.

5. Tiny Conditional Decoder (TC Decoder)

After distillation, FlashVSR replaces the original 3D VAE decoder—which incurred $\sim$ 70% of runtime—with a lightweight TC Decoder. This module conditions on the latent $z_t$ at 1/8 spatial resolution and LR frame features, leveraging a shallow UNet-style architecture with pixel-shuffle upsampling to target resolution. Parameter count is reduced (1.75B vs 3.4B), and decode time drops to 1/7. The conditioning leads to higher accuracy than a purely latent-conditioned “tiny” decoder.

The loss for TC Decoder training is:

$\mathcal{L}_{\mathrm{TC}} = \|x_{\mathrm{pred}} - x_{\mathrm{gt}}\|_2^2 + \lambda \mathcal{L}_{\mathrm{LPIPS}}(x_{\mathrm{pred}}, x_{\mathrm{gt}}) + \|x_{\mathrm{pred}} - x_{\mathrm{wan}}\|_2^2 + \lambda \mathcal{L}_{\mathrm{LPIPS}}(x_{\mathrm{pred}}, x_{\mathrm{wan}})$

with $\lambda = 2$ .

6. Dataset Construction and Training Procedure

VSR-120K, a new large-scale training dataset, contains 120K video clips (average $>$ 350 frames) and 180K images above $1080$p, sourced from Videvo, Pexels, and Pixabay. Filtering uses LAION-Aesthetic and MUSIQ metrics for quality, combined with RAFT optical flow to verify motion presence. Data degradation for robust VSR is applied following RealBasicVSR (blur, noise, compression).

Training utilizes 32 $\times$ A100-80GB GPUs with AdamW (lr = $1 \times 10^{-5}$ , weight decay = 0.01, batch = 32). LoRA adapters (rank=384) facilitate WAN 2.1 weight fine-tuning. Stage durations: Stage 1—2 days, Stage 2—1 day, Stage 3—2 days, TC Decoder—2 days.

7. Quantitative Benchmarks and Implementation

FlashVSR achieves state-of-the-art PSNR, SSIM, and LPIPS metrics across standard synthetic (YouHQ40, REDS, SPMCS) and real (VideoLQ, AIGC30) datasets. For instance, on YouHQ40, PSNR = 24.39, SSIM = 0.6651, and LPIPS = 0.3866 for FlashVSR-Tiny.

A comparative summary:

Method	Peak Mem (GB)	Runtime (s) / FPS	Params (M)
Upscale-A-Video	18.4	811.7 / 0.12	1087
STAR	24.9	682.5 / 0.15	2493
DOVE	25.4	72.8 / 1.39	10,549
SeedVR2-3B	52.9	70.6 / 1.43	3391
Ours-Full	18.3	15.5 / 6.52	1780
Ours-Tiny	11.1	5.97 / 16.92	1752

Look-ahead latency is minimized to 8 frames, advantageous over previous methods (e.g., STAR at 32 frames). FlashVSR-Tiny attains up to 17 FPS at 768 $\times$ 1408 on a single A100 GPU, with 11.1GB memory consumption.

Ablation studies indicate:

Sparse attention (13.6% density) maintains PSNR close to dense (24.11 vs 24.65 on REDS) with 3.1 $\times$ speedup.
The TC Decoder offers a 7 $\times$ speedup vs the Wan decoder (PSNR 31.08 vs 32.58; LPIPS 0.1014 vs 0.0715).
Boundary-Preserved locality-constrained attention yields highest fidelity (PSNR 24.87, SSIM 0.7232, LPIPS 0.3304).

Implementation advisories include matching local window scale to training positional range, using block size 128 for sparse attention, and adopting RealBasicVSR degradation to ensure real-world robustness.

8. Practical Considerations and Reproducibility

Default KV-cache eviction employs a sliding window across the last $L$ frames, as more sophisticated importance-based schemes underperform. Cross-attention with text prompts is disabled via a fixed prompt to avoid computational overhead. All model, code, and dataset resources, as well as detailed hyperparameter settings, are released to support reproducibility. Important practices include careful calibration of local window and batch size, consistent use of LoRA for efficient adaptation, and reliance on real-world degradation pipelines during training to ensure domain generalization.

In summary, FlashVSR unifies a one-step distilled diffusion transformer, temporally causal streaming inference, locality-constrained sparse attention, and a highly efficient conditional decoder, delivering SOTA VSR quality at near real-time frame rates on single-GPU hardware and robust scalability to ultra-high resolutions (Zhuang et al., 14 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashVSR Framework.

FlashVSR: Real-Time Diffusion VSR

1. System Overview and Objectives

2. One-Step Diffusion-Based Streaming Pipeline

3. Three-Stage Distillation Pipeline

4. Locality-Constrained Sparse Attention

5. Tiny Conditional Decoder (TC Decoder)

6. Dataset Construction and Training Procedure

7. Quantitative Benchmarks and Implementation

8. Practical Considerations and Reproducibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FlashVSR: Real-Time Diffusion VSR

1. System Overview and Objectives

2. One-Step Diffusion-Based Streaming Pipeline

3. Three-Stage Distillation Pipeline

4. Locality-Constrained Sparse Attention

5. Tiny Conditional Decoder (TC Decoder)

6. Dataset Construction and Training Procedure

7. Quantitative Benchmarks and Implementation

8. Practical Considerations and Reproducibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research