Papers
Topics
Authors
Recent
Search
2000 character limit reached

SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Published 5 Jun 2025 in cs.CV | (2506.05301v1)

Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

Summary

  • The paper introduces a one-step video restoration model that converts a multi-step diffusion transformer into an efficient generator via adversarial post-training.
  • It leverages adaptive window attention and progressive distillation to mitigate artifacts and preserve quality on high-resolution videos.
  • Benchmarking shows that SeedVR2 achieves over four times faster inference than traditional methods while excelling on perceptual metrics.

SeedVR2 is a novel video restoration (VR) model designed for one-step inference, addressing the significant computational cost of multi-step diffusion models in high-resolution VR. It builds upon the concept of Diffusion Adversarial Post-Training (APT) (Lin et al., 14 Jan 2025), where a pre-trained multi-step diffusion model is used as initialization and then the entire network is fine-tuned using an adversarial training objective against real data.

The core idea is to convert a powerful multi-step diffusion transformer, such as SeedVR [(2025.05301)v1], into a one-step generator while maintaining or improving its restoration quality. This is achieved through a combination of architectural enhancements and training procedure improvements tailored for video restoration.

Key Implementation Details and Enhancements:

  1. Adaptive Window Attention:
    • Problem: Standard window attention mechanisms with predefined window sizes can cause visible boundary artifacts when applied to high-resolution videos, especially when the test resolution differs significantly from the training resolution or has a different aspect ratio. This is partly due to insufficient training on diverse window sizes and inconsistent handling of variable-sized windows near boundaries.
    • Solution: SeedVR2 introduces an adaptive window attention mechanism. During training, the window size (pt,ph,pwp_t, p_h, p_w) is dynamically calculated based on the input feature dimensions (dt,dh,dwd_t, d_h, d_w) and predefined window numbers (nt,nh,nwn_t, n_h, n_w) using:

      $p_t = \ceil*{\frac{\min(d_t, 30)}{n_t}, \quad p_h = \ceil*{\frac{d_h}{n_h}, \quad p_w = \ceil*{\frac{d_w}{n_w}$

      This ensures varied window sizes during training, improving generalization.

    • Test-time Robustness: For inference on arbitrary high resolutions (d^t,d^h,d^w\hat{d}_t, \hat{d}_h, \hat{d}_w), a spatial proxy resolution (d~h,d~w)(\tilde{d}_h, \tilde{d}_w) consistent with the training aspect ratio is derived:

      $\tilde{d}_h = \sqrt{d_h \times d_w \times \frac{\hat{d}_h}{\hat{d}_w}, ~ ~ \tilde{d}_w = \sqrt{d_h \times d_w \times \frac{\hat{d}_w}{\hat{d}_h}$

      where (dh,dw)(d_h, d_w) is the training spatial resolution. The final test-time window size is then calculated using (d^t,d~h,d~w)(\hat{d}_t, \tilde{d}_h, \tilde{d}_w) in the training-time formula. This consistent windowing strategy substantially reduces boundary artifacts in high-resolution outputs (e.g., 1080p), as shown in ablation studies.

  2. Training Procedures:
    • Progressive Distillation: To prevent a large performance drop when converting a multi-step model directly to a one-step generator via adversarial training, SeedVR2 employs a progressive distillation stage before adversarial training. The model is distilled from 64 sampling steps down to 1 step with a stride of 2, using a simple mean squared error loss. This step helps bridge the gap between the initial multi-step model and the one-step target, maintaining restoration capabilities, especially for heavy degradations. Training data temporal length is also progressively increased.
    • Loss Improvements for Stability: Large-scale adversarial training can be unstable. SeedVR2 incorporates several loss enhancements:

      • RpGAN and Approximated R2 Regularization: The non-saturating GAN loss used in prior APT work is replaced with RpGAN loss [Jolicoeur-Martineau 2019] to mitigate potential mode dropping. An approximated R2 regularization (Eq. 3) is added to penalize the gradient norm of the discriminator on fake data, further improving stability:

        LaR2=∥D(x^,c)−D(N(x^,σI),c)∥22\mathcal{L}_{aR2} = \| D(\hat{x}, c) - D(\mathcal{N}(\hat{x}, \sigma\mathbf{I}), c) \|^2_2

        where x^\hat{x} is the generated sample, cc is the condition, and σ\sigma is noise variance. These additions lead to more stable training over thousands of iterations.

      • Feature Matching Loss: Computing LPIPS loss [zhang2018unreasonable] for high-resolution video is computationally prohibitive as it requires decoding to pixel space. SeedVR2 proposes an efficient feature matching loss (Eq. 4) that extracts features directly from intermediate layers of the discriminator (specifically, before the cross-attention blocks used for APT logits) and measures the L1 distance between features of predictions and ground truths:

        LF=13∑i=16,26,36∥DiF(x^,c)−DiF(x,c)∥1\mathcal{L}_{F} = \frac{1}{3} \sum_{i=16, 26, 36} \|D^F_i(\hat{x}, c)- D^F_i(x, c)\|_1

        This loss serves as an alternative to LPIPS, guiding the generator to produce outputs with similar perceptual features as the ground truth. It adds minimal computational overhead as discriminator features are already computed for the GAN loss. Default loss weights are 1.0 for L1, Feature Matching, and GAN loss for the generator, and 1.0 for GAN, 1000 for R1 and R2 for the discriminator.

Practical Considerations and Performance:

  • Computational Resources: Training is resource-intensive, requiring 72 NVIDIA H100-80G GPUs with sequence and data parallelism. Each training stage takes about one day.
  • Model Size vs. Inference Speed: SeedVR2 employs a large model, with the generator having ~8.2B parameters (7B model) or ~3.4B parameters (3B distilled model). The total system, including the discriminator, is noted to be around 16B parameters for the 7B version. Despite this size, the one-step inference dramatically reduces computation compared to multi-step diffusion models. As shown in Table B.1, SeedVR2 (7B, 1 step) takes ~270-300 seconds for a 100-frame 720p video, which is over 4 times faster than multi-step diffusion methods (e.g., SeedVR [(2025.05301)v1], UAV [zhou2024upscaleavideo], MGLD-VSR [yang2023mgldvsr], STAR [xie2025star]) running for 50 steps (~1200-2300 seconds).
  • Performance: Experiments on synthetic, real-world (VideoLQ [chan2022investigating]), and AIGC datasets show SeedVR2 achieves comparable or superior performance to existing methods, particularly excelling in perceptual metrics like LPIPS, DISTS, NIQE, MUSIQ, and DOVER. A user study confirms SeedVR2's strong visual quality preference over baselines.
  • Trade-offs: There's a perception-distortion trade-off influenced by L1 and Feature Matching loss weights. Higher weights improve fidelity but can lead to over-smoothing, while lower weights enable better visual quality guided by the GAN objective.
  • VAE Bottleneck: A significant practical limitation is the efficiency of the causal video VAE used for encoding/decoding. While the diffusion inference is one-step, the VAE takes over 95% of the total inference time for typical videos, highlighting it as a critical area for future optimization.
  • Robustness: SeedVR2 is not always robust to very heavy degradations, large motions, or inputs with very light degradations (sometimes causing oversharpening).
  • Deployment: Deploying such a large model requires substantial computational resources, although the one-step inference makes it more feasible for latency-sensitive applications compared to multi-step methods.

Real-World Applications:

SeedVR2 is designed for real-world video restoration tasks like super-resolution, denoising, and deblurring, aiming to upscale low-quality videos to high resolution with enhanced detail and realism. The one-step inference makes it potentially applicable in scenarios requiring faster processing, such as video editing pipelines, streaming services, or potentially even real-time enhancement (though the VAE bottleneck currently limits this). Its training on a mix of synthetic and real-world/AIGC data aims to improve generalization to diverse real-world conditions.

In summary, SeedVR2 represents a significant step towards efficient, high-quality video restoration by successfully translating a multi-step diffusion transformer into a one-step GAN-like model through careful architectural design (adaptive window attention) and robust adversarial training strategies (progressive distillation, improved losses). While the video VAE efficiency and robustness to extreme degradations remain challenges, SeedVR2 demonstrates the potential of one-step diffusion models for practical video restoration applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 13 tweets with 392 likes about this paper.