InfVSR: Streaming Diffusion VSR

Updated 5 October 2025

InfVSR is a video super-resolution framework that reformulates VSR as an autoregressive one-step diffusion paradigm, ensuring both local smoothness and global semantic coherence.
It leverages causal adaptation of pre-trained diffusion transformers with a rolling KV-cache and joint visual guidance to maintain temporal context and semantic consistency.
Innovative supervision techniques with patch-wise pixel loss and cross-chunk distribution matching enable single-step per-chunk inference, achieving up to 58× speed-up on long-form videos.

InfVSR designates a video super-resolution (VSR) framework built to overcome key challenges in scaling high-quality super-resolution methods to arbitrarily long video sequences. It achieves this by reformulating VSR as an autoregressive-one-step-diffusion paradigm, leveraging pre-trained video diffusion priors with causal adaptation for efficient streaming inference. InfVSR integrates architectural modifications, process distillation, and new evaluation methodologies to enable superior semantic-level temporal consistency and dramatically increased computational efficiency for long-form videos.

1. Reformulation as Autoregressive-One-Step Diffusion

InfVSR introduces an autoregressive-one-step-diffusion (AR-OSD) paradigm for VSR. Instead of processing complete sequences in batch, videos are partitioned into non-overlapping chunks, each processed sequentially. Within each chunk, a one-step diffusion process utilizes a pre-trained text-to-video (T2V) diffusion prior to recover high-resolution frames. Temporal consistency across chunks is enforced via compact autoregressive temporal context propagation, denoted as $\mathcal{P}_k$ , yielding the overall factorized joint conditional distribution: $p(\mathbf{y}_{1:K} \mid \mathbf{x}_{1:K}) = \prod_{k=1}^K p(\mathbf{y}_k \mid \mathbf{x}_k, \mathcal{P}_k), \quad \mathbf{y}_k = G_\theta(\mathbf{x}_k, \mathcal{P}_k)$ This approach maintains both local smoothness within chunks and global semantic coherence across arbitrarily long sequences. Streaming inference with constant per-chunk memory usage allows processing of videos with thousands of frames, overcoming the inefficiency and poor scalability of multi-step denoising approaches and temporal decomposition methods that suffer from discontinuity and artifacts (Zhang et al., 1 Oct 2025).

2. Causal Adaptation of Pre-trained Diffusion Transformers

To enable autoregressive streaming and long-term semantic consistency, InfVSR adapts a pre-trained 3D Diffusion Transformer (DiT) model for causal inference through two key mechanisms:

Rolling KV-cache for Local Consistency:

Self-attention layers maintain a rolling key–value cache, concatenating positional embeddings of current and previous chunks to build local temporal context. This structure allows retrieval of a finite set of past representations, enabling temporal smoothness without unbounded memory requirements.

Joint Visual Guidance for Global Consistency:

Low-resolution input frames, which preserve essential semantic structure, are processed using a visual encoder (e.g., DAPE) to produce global visual prompts. These prompts are incorporated via cross-attention into the DiT backbone, providing strong semantic anchors and enhancing cross-chunk coherence. This joint approach converts DiT from a full-attention generator into a causal, autoregressive video super-resoluter, finely balancing short-term smoothness and long-term semantic consistency (Zhang et al., 1 Oct 2025).

3. Efficient Diffusion Distillation via Patch-wise and Distribution Matching Supervision

InfVSR eliminates the slow, multi-step denoising process typical of diffusion-based VSR by distilling the generative process into a single-step mapping per chunk. The distillation employs two critical forms of supervision:

Patch-wise Pixel Supervision:

In memory-constrained settings, spatial patches are randomly cropped from the latent space (operating within a 3D VAE). Let $\hat{\mathbf{z}} \in \mathbb{R}^{B\times C\times F\times H\times W}$ (latent video) and $D(\cdot)$ the VAE decoder. Patch extraction operators $\mathcal{C}_{\text{lat}}(\cdot)$ (latent) and $\mathcal{C}_{\text{pix}}(\cdot)$ (pixel) align crops: $\hat{\mathbf{x}}_{\text{sr}} = D(\mathcal{C}_{\text{lat}}(\hat{\mathbf{z}})),\qquad \hat{\mathbf{x}}_{\text{gt}} = \mathcal{C}_{\text{pix}}(\mathbf{x}_{\text{gt}})$ The overall pixel-level loss consists of fidelity and temporal terms: $\mathcal{L}_{\text{pix}} = \lambda_{\text{mse}}\mathcal{L}_{\text{mse}} + \lambda_{\text{dists}}\mathcal{L}_{\text{dists}} + \lambda_{\text{temp}}\mathcal{L}_{\text{temp}}$ where

$\mathcal{L}_{\text{temp}} = \lVert (\hat{\mathbf{x}}_{\text{gt}}^{t+1}-\hat{\mathbf{x}}_{\text{gt}}^t) - (\hat{\mathbf{x}}_{\text{sr}}^{t+1}-\hat{\mathbf{x}}_{\text{sr}}^t) \rVert^2$

Cross-Chunk Distribution Matching (DMD Loss):

To preserve semantic consistency over long sequences, features concatenated from adjacent chunks are matched to a pretrained teacher video model's feature distribution by minimizing KL divergence: $\mathcal{L}_{\text{DMD}} = \mathbb{E}_t \left[ KL(p_{\text{gen}} \| p_{\text{data}}) \right]$ This cross-chunk, distribution-level supervision counteracts semantic drift and ensures that the distilled diffusion maintains content coherence over extended temporal windows (Zhang et al., 1 Oct 2025).

4. Benchmarking for Long-form Video and Semantic Consistency Metrics

InfVSR authors propose the MovieLQ benchmark to address the gap in evaluation for long-form video super-resolution. This dataset comprises real-world videos up to 1000 frames, sourced under Creative Commons licensing with authentic real-world degradations—no synthetic corruption.

Evaluation Metrics Used:

Full-reference: PSNR, SSIM, LPIPS, DISTS
No-reference: MUSIQ, CLIP-IQA, DOVER (video quality metric)
Temporal Consistency: pixel-level flow warping error $E^*_{\text{warp}}$
Semantic Consistency (VBench): background consistency (BC), subject consistency (SC), motion smoothness (MS)

This comprehensive framework is designed to judge InfVSR not only on frame fidelity and perceptual metrics but also on the preservation of semantic and temporal coherence over long sequences (Zhang et al., 1 Oct 2025).

5. Results and Comparative Performance

InfVSR attains state-of-the-art performance across standard VSR benchmarks, including UDM10, SPMCS, MVSR4x, VideoLQ, and MovieLQ.

InfVSR consistently ranks as best or second-best in PSNR, SSIM, LPIPS, and DISTS.
The autoregressive paradigm achieves improvements in background/subject consistency (semantic VBench metrics) and reduces pixel-level temporal warping errors.
Experiments show up to 58× speed-up over multi-step diffusion baselines such as MGLD-VSR.
Memory usage is constant per chunk (e.g., processing a 33-frame 720p video chunk requires ~20.4 GB and ~6.82 s for inference). Inference time grows linearly with video length, preserving scalability.

A plausible implication is that InfVSR is suitable for both offline and real-time processing of long-form videos, such as films or broadcast content, without degradation in output sharpness or consistency (Zhang et al., 1 Oct 2025).

6. Impact and Implications

InfVSR’s paradigm shift toward AR-OSD diffusion, causal adaptation of DiT, and scalable supervision strategies addresses thoroughly the historical limitations of VSR for extended video sequences. Its streaming inference, semantic consistency, and efficiency improvements open the possibility of applying super-resolution to large-scale archives, real-time streaming, and scenarios where memory and latency are restrictive considerations.

The architectural innovations and semantic-level metrics introduced by InfVSR provide a methodological foundation for subsequent research in VSR and related video enhancement fields. A plausible implication is the broader adoption of chunkwise autoregressive diffusion modeling, rolling context caches, and visual prompt sharing for scalable video processing tasks beyond super-resolution (Zhang et al., 1 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

InfVSR: Breaking Length Limits of Generic Video Super-Resolution (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to InfVSR.