Papers
Topics
Authors
Recent
2000 character limit reached

Stream-DiffVSR: Streaming Video Super-Resolution

Updated 30 December 2025
  • The paper presents Stream-DiffVSR, a diffusion-based video super-resolution framework that reduces multi-step denoising to a single-step process for real-time streaming.
  • It integrates causal LR-projection, block-sparse attention, and a lightweight decoder to achieve significant speedups and competitive perceptual quality.
  • The approach leverages auto-regressive distillation and locality-constrained sparse attention to efficiently handle temporal context and mitigate latency in high-resolution video restoration.

Stream-DiffVSR is a streaming, low-latency, diffusion-based video super-resolution (VSR) framework designed for real-time online deployment, overcoming the substantial inference delay and lookahead typical for diffusion models in video restoration. By strictly conditioning on past frames and integrating innovations such as auto-regressive distillation, block-sparse attention, and an efficient lightweight decoder, Stream-DiffVSR achieves superior perceptual quality and inference speed, making diffusion-based VSR practical for latency-sensitive settings (Shiu et al., 29 Dec 2025, Zhuang et al., 14 Oct 2025).

1. System Architecture and Streaming Pipeline

The Stream-DiffVSR framework is composed of three principal modules, forming a causal streaming pipeline:

  • Causal LR-Projection-In: Ingests each low-resolution (LR) frame independently in real time. Employs causal pixel-shuffle and 3D convolutions to generate a compact feature representation.
  • Block-Sparse Diffusion Transformer (DiT) Backbone: Utilizes sliding-window causal attention on past latent representations, enabling efficient context modeling over substantial temporal ranges. A distillation pipeline compresses inference to a single step.
  • Tiny Conditional (TC) Decoder: Reconstructs each high-resolution (HR) frame using denoised latent and corresponding LR input, enabling rapid generation with minimal loss in perceptual fidelity.

At inference time, each new frame xLR,tx_{\text{LR},t} is combined with fresh Gaussian noise, processed through the single-step denoiser, and then decoded to output xSR,tx_{\text{SR},t} with minimal delay:

xLR,t1xLR,txLR,t+1  zt1ztzt+1  xSR,t1xSR,txSR,t+1\begin{array}{cccccc} \cdots & x_{\text{LR},t-1} & x_{\text{LR},t} & x_{\text{LR},t+1} & \cdots \ & \downarrow & \downarrow & \downarrow & \ \cdots & z_{t-1} & z_t & z_{t+1} & \cdots \ & \searrow & \searrow & \searrow & \ \cdots & x_{\text{SR},t-1} & x_{\text{SR},t} & x_{\text{SR},t+1} & \cdots \end{array}

A KV-cache of past latents and a sliding context window (default 85 frames) are maintained for efficient temporal modeling. The method's end-to-end streaming design enables continuous, frame-aligned output without future-frame dependency (Zhuang et al., 14 Oct 2025).

2. Distillation Pipeline

A progressive, three-stage pipeline distills a large multi-step diffusion model to an efficient one-step denoiser suitable for streaming:

  • Stage 1: Video–Image Joint SR Teacher: A full-attention DiT is trained on paired LR-HR video clips (up to 89 frames of 768×1280768\times1280), including single frames as 1-frame "videos." The loss is flow matching (FM) [1], formulated as:

LFM=Et,zN(0,I)1αt(zαˉtzHR)Gfull()22\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, z\sim\mathcal{N}(0,I)}\left\| \frac{1}{\sqrt{\alpha_t}} (z - \sqrt{\bar{\alpha}_t}z_{\text{HR}}) - G_{\text{full}}(\cdot) \right\|^2_2

GfullG_{\text{full}} denotes the teacher, using block-diagonal temporal attention.

  • Stage 2: Block-Sparse Causal Adaptation: The teacher is adapted with causal masking and block-sparse 3D self-attention, further trained using FM on video sequences; the LR-Proj-In module maintains a short temporal cache for causality.
  • Stage 3: One-Step Student Distillation: The final student, GoneG_{\text{one}}, matches the teacher architecture but is trained to denoise in a single step, guided by a distribution-matching distillation (DMD) strategy [2]. The composite loss:

L=LDMD+LFM+xpredxgt22+λLLPIPS(xpred,xgt)\mathcal{L} = \mathcal{L}_{\mathrm{DMD}} + \mathcal{L}_{\text{FM}} + \|x_{\mathrm{pred}}-x_{\mathrm{gt}}\|_2^2 + \lambda\,\mathcal{L}_{\mathrm{LPIPS}}(x_{\mathrm{pred}}, x_{\mathrm{gt}})

with λ=2\lambda=2 and additional data augmentation from the RealBasicVSR pipeline.

This pipeline enables the compression of expensive multi-step denoising into a single, causal, and computationally efficient sampling operation (Zhuang et al., 14 Oct 2025).

3. Locality-Constrained Sparse Attention

Full spatiotemporal attention in 3D (O((THW)2)O((THW)^2) complexity) is computationally prohibitive for high-resolution streaming. Stream-DiffVSR introduces block-sparse attention with spatial locality masks:

  • Tokens are partitioned into (tb,hb,wb)=(2,8,8)(t_b,h_b,w_b)=(2,8,8) blocks. Block attention is computed via pooled queries and keys.
  • For each query block, top-kk neighbor blocks are selected, and fine attention is computed among their 128\sim128 tokens.
  • A locality mask enforces a spatial radius rr, restricting attention for query tokens at (t,i,j)(t,i,j) to positions with iir|i'-i|\le r, jjr|j'-j|\le r, and ttt'\le t (for causality):

M(t,i,j),(t,i,j)=1[iir  jjr  tt]M_{(t,i,j),(t',i',j')} = \mathbf{1}[|i'-i|\le r\ \wedge\ |j'-j|\le r\ \wedge\ t'\le t]

  • Computational complexity is reduced to O(THWr2d)O(THW r^2 d).

This attention regime enables scalability to ultra-high resolutions (up to 1536×26881536 \times 2688) while preventing aliasing artifacts from out-of-range rotary positional encoding (Zhuang et al., 14 Oct 2025).

4. Efficient Decoder Design

The TC decoder replaces the original high-capacity 3D VAE decoder, achieving a significant speedup:

  • Architecture: Accepts the one-step latent ztz_t and a downsampled, pixel-shuffled LR frame xLR,tx_{\text{LR},t}; concatenated inputs pass through four Conv3D layers (channel-reducing, SiLU activations, group normalization), followed by a pixel-shuffle for 4×4\times upscaling.
  • Parameter Count: ~2M versus ~10M for the original design.
  • Speed: Decoding time for 101-frame 768×1408768\times1408 clips is reduced from 1.6s to 0.23s, a 7×7\times improvement, with a PSNR drop below 0.5dB and negligible perceptual loss.
  • Loss Function:

Ldec=xpredxgt22+λLLPIPS(xpred,xgt)+xpredxwan22+λLLPIPS(xpred,xwan)\mathcal{L}_{\mathrm{dec}} = \|x_{\text{pred}}-x_{\text{gt}}\|_2^2 + \lambda\,\mathcal{L}_{\mathrm{LPIPS}}(x_{\text{pred}}, x_{\text{gt}}) + \|x_{\text{pred}}-x_{\text{wan}}\|_2^2 + \lambda\,\mathcal{L}_{\mathrm{LPIPS}}(x_{\text{pred}}, x_{\text{wan}})

with λ=2\lambda=2.

Conditioning on the LR input allows channel and depth reduction, focusing decoder capacity on rendering high-frequency details (Zhuang et al., 14 Oct 2025).

5. Inference Protocol and Real-Time Deployment

Streaming inference is implemented as a loop over arriving LR frames:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Initialize KV_cache  empty list
for t = 1,2, do
  x_LR  get_next_LR_frame()
  ε  sample_normal(size=z_shape)
  # 1-step denoising
  z_pred, KV_cache  DiT_one_step(
      input_noise=ε,
      cond_LR=x_LR,
      cache=KV_cache,
      θ=θ_one )
  # Decode
  x_SR  TC_decoder(z_pred, x_LR; θ_dec)
  emit_frame(x_SR)
  # Cache management
  if len(KV_cache)>W: pop_oldest(KV_cache)
end for

Deployment optimization includes mixed-precision computation, fused kernel operations, CUDA stream separation, small-batch parallelism, use of FlashAttention/Triton kernels, and NVLink memory sharing. This approach enables throughput of 17 FPS at 768x1408 resolution per A100 GPU (Zhuang et al., 14 Oct 2025).

6. Quantitative Performance and Comparison

Stream-DiffVSR demonstrates state-of-the-art latency and competitive perceptual quality compared to prior diffusion-based VSR systems. Table below summarizes performance reported for a 101-frame 768×1408768\times1408 video:

Method FPS Peak Mem (GB) Params (M) PSNR SSIM LPIPS
Upscale-A-Video (30 steps) 0.12 18.4 1087 23.19 0.6075 0.4585
STAR (15 steps) 0.15 24.9 2493 23.19 0.6388 0.4705
DOVE (1 step) 1.39 25.4 10549 24.39 0.6651 0.4011
SeedVR2-3B (1 step) 1.43 52.9 3391 23.05 0.6248 0.3876
Stream-DiffVSR 16.92 11.1 1752 23.31 0.6110 0.3866

Stream-DiffVSR achieves a \sim12x speedup over SeedVR2-3B with lower memory consumption and either on-par or improved perceptual quality (lower LPIPS). End-to-end latency for 720p frames is reduced from over 4600 seconds (multi-step methods) to 0.328 seconds on an RTX4090 (Shiu et al., 29 Dec 2025, Zhuang et al., 14 Oct 2025).

7. Limitations and Prospects

Key limitations include scalability to true 4K+ resolutions (requiring further tiling or distributed strategies), unsophisticated sliding-window KV-cache eviction (where learned importance-based pruning could suffice), and fixed prompt conditioning. Combining 2-3 steps with block-sparse attention may offer improved quality-latency tradeoffs. Incorporation of learned motion priors, such as lightweight optical-flow modules, is likely to enhance performance in scenes with extreme motion (Zhuang et al., 14 Oct 2025).


References:

  • Lipman et al., "Flow matching for generative modeling" (Lipman et al., 2022).
  • Yin et al., "One-step diffusion with distribution matching distillation" (CVPR '24).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Stream-DiffVSR.