- The paper introduces a one-step video restoration model that converts a multi-step diffusion transformer into an efficient generator via adversarial post-training.
- It leverages adaptive window attention and progressive distillation to mitigate artifacts and preserve quality on high-resolution videos.
- Benchmarking shows that SeedVR2 achieves over four times faster inference than traditional methods while excelling on perceptual metrics.
SeedVR2 is a novel video restoration (VR) model designed for one-step inference, addressing the significant computational cost of multi-step diffusion models in high-resolution VR. It builds upon the concept of Diffusion Adversarial Post-Training (APT) (Lin et al., 14 Jan 2025), where a pre-trained multi-step diffusion model is used as initialization and then the entire network is fine-tuned using an adversarial training objective against real data.
The core idea is to convert a powerful multi-step diffusion transformer, such as SeedVR [(2025.05301)v1], into a one-step generator while maintaining or improving its restoration quality. This is achieved through a combination of architectural enhancements and training procedure improvements tailored for video restoration.
Key Implementation Details and Enhancements:
- Adaptive Window Attention:
- Problem: Standard window attention mechanisms with predefined window sizes can cause visible boundary artifacts when applied to high-resolution videos, especially when the test resolution differs significantly from the training resolution or has a different aspect ratio. This is partly due to insufficient training on diverse window sizes and inconsistent handling of variable-sized windows near boundaries.
Solution: SeedVR2 introduces an adaptive window attention mechanism. During training, the window size (pt​,ph​,pw​) is dynamically calculated based on the input feature dimensions (dt​,dh​,dw​) and predefined window numbers (nt​,nh​,nw​) using:
$p_t = \ceil*{\frac{\min(d_t, 30)}{n_t}, \quad p_h = \ceil*{\frac{d_h}{n_h}, \quad p_w = \ceil*{\frac{d_w}{n_w}$
This ensures varied window sizes during training, improving generalization.
Test-time Robustness: For inference on arbitrary high resolutions (d^t​,d^h​,d^w​), a spatial proxy resolution (d~h​,d~w​) consistent with the training aspect ratio is derived:
$\tilde{d}_h = \sqrt{d_h \times d_w \times \frac{\hat{d}_h}{\hat{d}_w}, ~ ~ \tilde{d}_w = \sqrt{d_h \times d_w \times \frac{\hat{d}_w}{\hat{d}_h}$
where (dh​,dw​) is the training spatial resolution. The final test-time window size is then calculated using (d^t​,d~h​,d~w​) in the training-time formula. This consistent windowing strategy substantially reduces boundary artifacts in high-resolution outputs (e.g., 1080p), as shown in ablation studies.
- Training Procedures:
Practical Considerations and Performance:
- Computational Resources: Training is resource-intensive, requiring 72 NVIDIA H100-80G GPUs with sequence and data parallelism. Each training stage takes about one day.
- Model Size vs. Inference Speed: SeedVR2 employs a large model, with the generator having ~8.2B parameters (7B model) or ~3.4B parameters (3B distilled model). The total system, including the discriminator, is noted to be around 16B parameters for the 7B version. Despite this size, the one-step inference dramatically reduces computation compared to multi-step diffusion models. As shown in Table B.1, SeedVR2 (7B, 1 step) takes ~270-300 seconds for a 100-frame 720p video, which is over 4 times faster than multi-step diffusion methods (e.g., SeedVR [(2025.05301)v1], UAV [zhou2024upscaleavideo], MGLD-VSR [yang2023mgldvsr], STAR [xie2025star]) running for 50 steps (~1200-2300 seconds).
- Performance: Experiments on synthetic, real-world (VideoLQ [chan2022investigating]), and AIGC datasets show SeedVR2 achieves comparable or superior performance to existing methods, particularly excelling in perceptual metrics like LPIPS, DISTS, NIQE, MUSIQ, and DOVER. A user study confirms SeedVR2's strong visual quality preference over baselines.
- Trade-offs: There's a perception-distortion trade-off influenced by L1 and Feature Matching loss weights. Higher weights improve fidelity but can lead to over-smoothing, while lower weights enable better visual quality guided by the GAN objective.
- VAE Bottleneck: A significant practical limitation is the efficiency of the causal video VAE used for encoding/decoding. While the diffusion inference is one-step, the VAE takes over 95% of the total inference time for typical videos, highlighting it as a critical area for future optimization.
- Robustness: SeedVR2 is not always robust to very heavy degradations, large motions, or inputs with very light degradations (sometimes causing oversharpening).
- Deployment: Deploying such a large model requires substantial computational resources, although the one-step inference makes it more feasible for latency-sensitive applications compared to multi-step methods.
Real-World Applications:
SeedVR2 is designed for real-world video restoration tasks like super-resolution, denoising, and deblurring, aiming to upscale low-quality videos to high resolution with enhanced detail and realism. The one-step inference makes it potentially applicable in scenarios requiring faster processing, such as video editing pipelines, streaming services, or potentially even real-time enhancement (though the VAE bottleneck currently limits this). Its training on a mix of synthetic and real-world/AIGC data aims to improve generalization to diverse real-world conditions.
In summary, SeedVR2 represents a significant step towards efficient, high-quality video restoration by successfully translating a multi-step diffusion transformer into a one-step GAN-like model through careful architectural design (adaptive window attention) and robust adversarial training strategies (progressive distillation, improved losses). While the video VAE efficiency and robustness to extreme degradations remain challenges, SeedVR2 demonstrates the potential of one-step diffusion models for practical video restoration applications.