Stream-DiffVSR: Streaming Video Super-Resolution

Updated 30 December 2025

The paper presents Stream-DiffVSR, a diffusion-based video super-resolution framework that reduces multi-step denoising to a single-step process for real-time streaming.
It integrates causal LR-projection, block-sparse attention, and a lightweight decoder to achieve significant speedups and competitive perceptual quality.
The approach leverages auto-regressive distillation and locality-constrained sparse attention to efficiently handle temporal context and mitigate latency in high-resolution video restoration.

Stream-DiffVSR is a streaming, low-latency, diffusion-based video super-resolution (VSR) framework designed for real-time online deployment, overcoming the substantial inference delay and lookahead typical for diffusion models in video restoration. By strictly conditioning on past frames and integrating innovations such as auto-regressive distillation, block-sparse attention, and an efficient lightweight decoder, Stream-DiffVSR achieves superior perceptual quality and inference speed, making diffusion-based VSR practical for latency-sensitive settings (Shiu et al., 29 Dec 2025, Zhuang et al., 14 Oct 2025).

1. System Architecture and Streaming Pipeline

The Stream-DiffVSR framework is composed of three principal modules, forming a causal streaming pipeline:

Causal LR-Projection-In: Ingests each low-resolution (LR) frame independently in real time. Employs causal pixel-shuffle and 3D convolutions to generate a compact feature representation.
Block-Sparse Diffusion Transformer (DiT) Backbone: Utilizes sliding-window causal attention on past latent representations, enabling efficient context modeling over substantial temporal ranges. A distillation pipeline compresses inference to a single step.
Tiny Conditional (TC) Decoder: Reconstructs each high-resolution (HR) frame using denoised latent and corresponding LR input, enabling rapid generation with minimal loss in perceptual fidelity.

At inference time, each new frame $x_{\text{LR},t}$ is combined with fresh Gaussian noise, processed through the single-step denoiser, and then decoded to output $x_{\text{SR},t}$ with minimal delay:

$\begin{array}{cccccc} \cdots & x_{\text{LR},t-1} & x_{\text{LR},t} & x_{\text{LR},t+1} & \cdots \ & \downarrow & \downarrow & \downarrow & \ \cdots & z_{t-1} & z_t & z_{t+1} & \cdots \ & \searrow & \searrow & \searrow & \ \cdots & x_{\text{SR},t-1} & x_{\text{SR},t} & x_{\text{SR},t+1} & \cdots \end{array}$

A KV-cache of past latents and a sliding context window (default 85 frames) are maintained for efficient temporal modeling. The method's end-to-end streaming design enables continuous, frame-aligned output without future-frame dependency (Zhuang et al., 14 Oct 2025).

2. Distillation Pipeline

A progressive, three-stage pipeline distills a large multi-step diffusion model to an efficient one-step denoiser suitable for streaming:

Stage 1: Video–Image Joint SR Teacher: A full-attention DiT is trained on paired LR-HR video clips (up to 89 frames of $768\times1280$ ), including single frames as 1-frame "videos." The loss is flow matching (FM) [1], formulated as:

$\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, z\sim\mathcal{N}(0,I)}\left\| \frac{1}{\sqrt{\alpha_t}} (z - \sqrt{\bar{\alpha}_t}z_{\text{HR}}) - G_{\text{full}}(\cdot) \right\|^2_2$

$G_{\text{full}}$ denotes the teacher, using block-diagonal temporal attention.

Stage 2: Block-Sparse Causal Adaptation: The teacher is adapted with causal masking and block-sparse 3D self-attention, further trained using FM on video sequences; the LR-Proj-In module maintains a short temporal cache for causality.
Stage 3: One-Step Student Distillation: The final student, $G_{\text{one}}$ , matches the teacher architecture but is trained to denoise in a single step, guided by a distribution-matching distillation (DMD) strategy [2]. The composite loss:

$\mathcal{L} = \mathcal{L}_{\mathrm{DMD}} + \mathcal{L}_{\text{FM}} + \|x_{\mathrm{pred}}-x_{\mathrm{gt}}\|_2^2 + \lambda\,\mathcal{L}_{\mathrm{LPIPS}}(x_{\mathrm{pred}}, x_{\mathrm{gt}})$

with $\lambda=2$ and additional data augmentation from the RealBasicVSR pipeline.

This pipeline enables the compression of expensive multi-step denoising into a single, causal, and computationally efficient sampling operation (Zhuang et al., 14 Oct 2025).

3. Locality-Constrained Sparse Attention

Full spatiotemporal attention in 3D ( $O((THW)^2)$ complexity) is computationally prohibitive for high-resolution streaming. Stream-DiffVSR introduces block-sparse attention with spatial locality masks:

Tokens are partitioned into $(t_b,h_b,w_b)=(2,8,8)$ blocks. Block attention is computed via pooled queries and keys.
For each query block, top- $k$ neighbor blocks are selected, and fine attention is computed among their $\sim128$ tokens.
A locality mask enforces a spatial radius $r$ , restricting attention for query tokens at $(t,i,j)$ to positions with $|i'-i|\le r$ , $|j'-j|\le r$ , and $t'\le t$ (for causality):

$M_{(t,i,j),(t',i',j')} = \mathbf{1}[|i'-i|\le r\ \wedge\ |j'-j|\le r\ \wedge\ t'\le t]$

Computational complexity is reduced to $O(THW r^2 d)$ .

This attention regime enables scalability to ultra-high resolutions (up to $1536 \times 2688$ ) while preventing aliasing artifacts from out-of-range rotary positional encoding (Zhuang et al., 14 Oct 2025).

4. Efficient Decoder Design

The TC decoder replaces the original high-capacity 3D VAE decoder, achieving a significant speedup:

Architecture: Accepts the one-step latent $z_t$ and a downsampled, pixel-shuffled LR frame $x_{\text{LR},t}$ ; concatenated inputs pass through four Conv3D layers (channel-reducing, SiLU activations, group normalization), followed by a pixel-shuffle for $4\times$ upscaling.
Parameter Count: ~2M versus ~10M for the original design.
Speed: Decoding time for 101-frame $768\times1408$ clips is reduced from 1.6s to 0.23s, a $7\times$ improvement, with a PSNR drop below 0.5dB and negligible perceptual loss.
Loss Function:

$\mathcal{L}_{\mathrm{dec}} = \|x_{\text{pred}}-x_{\text{gt}}\|_2^2 + \lambda\,\mathcal{L}_{\mathrm{LPIPS}}(x_{\text{pred}}, x_{\text{gt}}) + \|x_{\text{pred}}-x_{\text{wan}}\|_2^2 + \lambda\,\mathcal{L}_{\mathrm{LPIPS}}(x_{\text{pred}}, x_{\text{wan}})$

with $\lambda=2$ .

Conditioning on the LR input allows channel and depth reduction, focusing decoder capacity on rendering high-frequency details (Zhuang et al., 14 Oct 2025).

5. Inference Protocol and Real-Time Deployment

Streaming inference is implemented as a loop over arriving LR frames:

Initialize KV_cache ← empty list
for t = 1,2,… do
  x_LR ← get_next_LR_frame()
  ε ← sample_normal(size=z_shape)
  # 1-step denoising
  z_pred, KV_cache ← DiT_one_step(
      input_noise=ε,
      cond_LR=x_LR,
      cache=KV_cache,
      θ=θ_one )
  # Decode
  x_SR ← TC_decoder(z_pred, x_LR; θ_dec)
  emit_frame(x_SR)
  # Cache management
  if len(KV_cache)>W: pop_oldest(KV_cache)
end for

Deployment optimization includes mixed-precision computation, fused kernel operations, CUDA stream separation, small-batch parallelism, use of FlashAttention/Triton kernels, and NVLink memory sharing. This approach enables throughput of 17 FPS at 768x1408 resolution per A100 GPU (Zhuang et al., 14 Oct 2025).

6. Quantitative Performance and Comparison

Stream-DiffVSR demonstrates state-of-the-art latency and competitive perceptual quality compared to prior diffusion-based VSR systems. Table below summarizes performance reported for a 101-frame $768\times1408$ video:

Method	FPS	Peak Mem (GB)	Params (M)	PSNR	SSIM	LPIPS
Upscale-A-Video (30 steps)	0.12	18.4	1087	23.19	0.6075	0.4585
STAR (15 steps)	0.15	24.9	2493	23.19	0.6388	0.4705
DOVE (1 step)	1.39	25.4	10549	24.39	0.6651	0.4011
SeedVR2-3B (1 step)	1.43	52.9	3391	23.05	0.6248	0.3876
Stream-DiffVSR	16.92	11.1	1752	23.31	0.6110	0.3866

Stream-DiffVSR achieves a $\sim$ 12x speedup over SeedVR2-3B with lower memory consumption and either on-par or improved perceptual quality (lower LPIPS). End-to-end latency for 720p frames is reduced from over 4600 seconds (multi-step methods) to 0.328 seconds on an RTX4090 (Shiu et al., 29 Dec 2025, Zhuang et al., 14 Oct 2025).

7. Limitations and Prospects

Key limitations include scalability to true 4K+ resolutions (requiring further tiling or distributed strategies), unsophisticated sliding-window KV-cache eviction (where learned importance-based pruning could suffice), and fixed prompt conditioning. Combining 2-3 steps with block-sparse attention may offer improved quality-latency tradeoffs. Incorporation of learned motion priors, such as lightweight optical-flow modules, is likely to enhance performance in scenes with extreme motion (Zhuang et al., 14 Oct 2025).

References:

Lipman et al., "Flow matching for generative modeling" (Lipman et al., 2022).
Yin et al., "One-step diffusion with distribution matching distillation" (CVPR '24).

PDF Markdown Chat (Pro)

References (3)

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion (2025)

FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution (2025)

Flow Matching for Generative Modeling (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Stream-DiffVSR.