VSR-120K Dataset for Video Super-Resolution
- VSR-120K is a large-scale dataset comprising 120K videos (>350 frames each, >1080p) and 180K high-resolution images, curated with strict quality controls.
- It employs advanced filtering methods including LAION-Aesthetic scoring, MUSIQ ratings, and RAFT-based motion analysis to ensure high spatial and temporal fidelity.
- Its dual setup of videos and images facilitates joint training for improved reconstruction accuracy, enabling efficient real-time streaming video super-resolution.
The VSR-120K dataset is a large-scale, rigorously filtered corpus designed to advance research and model development in video super-resolution (VSR) under practical, real-world constraints. Introduced in the context of the FlashVSR framework (Zhuang et al., 14 Oct 2025), VSR-120K provides both the scale and quality necessary for joint training regimes that exploit spatial and temporal cues, supporting efficient, high-fidelity reconstruction with diffusion-based and streaming VSR architectures.
1. Scale, Structure, and Data Sources
VSR-120K consists of 120,000 videos and 180,000 high-quality images. Videos average over 350 frames in length, with resolutions strictly above 1080p; images have a minimum side exceeding 1024 pixels. Raw content is sourced from open repositories such as Videvo, Pexels, and Pixabay. This curation supplies extensive diversity across scene types, motion dynamics, and textural variety—attributes essential for robust VSR learning.
Quality control is central to the dataset’s design. LAION-Aesthetic scores filter visual appeal and overall image quality, while MUSIQ provides frame-level measures for both global and local features, discarding compressed or artifact-laden data. RAFT-based motion filtering ensures that retained videos demonstrate sufficient motion, critical for learning representations of temporal coherence.
| Component | Quantity | Quality Criterion |
|---|---|---|
| Videos | 120,000 | >350 frames; >1080p; sufficient motion (RAFT) |
| Images | 180,000 | Min side >1024px; LAION-Aesthetic & MUSIQ filtered |
2. Design Rationale and Intended Usage
The dataset is constructed as a backbone for joint image-video training in video super-resolution, particularly for diffusion and large-scale neural architectures. Single-frame images provide rich spatial supervision that mitigates compression artifacts and supplies fine texture detail, while video sequences offer temporally contiguous data required for exploiting motion cues and temporal consistency.
This dual-setup enables models to learn mappings from low-resolution (LR) to high-resolution (HR) inputs with improved generalization. By enhancing both spatial and temporal learning, VSR-120K addresses the “train–inference gap” commonly observed in ultra-high-resolution VSR, where training on static images alone or low-resolution videos can result in generalization failures when processing real-world high-resolution streams.
A plausible implication is that this dataset configuration directly facilitates advancements in streaming VSR frameworks, as joint training yields models resilient to both types of domain shift.
3. Innovations in Curation and Filtering
VSR-120K introduces robust multi-stage filtering protocols not seen in prior VSR corpora. LAION-Aesthetic scoring and MUSIQ ratings ensure high-quality supervision for spatial detail, while RAFT-based analysis selects only videos with sufficient mid- and high-magnitude motion. This multi-factor filtering yields a pool of samples with reliable ground-truth for both fine-texture and motion-aware reconstruction.
For comparison, previous VSR datasets typically contain only several thousand videos, less stringent resolution requirements, and do not employ motion-based quality filtering.
| Process | Goal | Method |
|---|---|---|
| LAION-Aesthetic / MUSIQ | High spatial quality, remove artifacts | Automated prediction |
| RAFT motion | Guarantee temporal dynamics | Per-video flow analysis |
This curation strategy results in a uniquely diverse dataset suitable for complex VSR training scenarios—particularly those involving large diffusion models and streaming frameworks.
4. Role in FlashVSR and Model Efficiency
In FlashVSR (Zhuang et al., 14 Oct 2025), VSR-120K enables a three-stage distillation pipeline and streaming one-step reconstruction. Training with VSR-120K allows FlashVSR to scale to ultra-high resolutions (768×1408, up to 1440p), maintain fidelity, and achieve real-time inference speeds (17 FPS on a single A100 GPU). Models trained on this dataset demonstrate up to a ~12× speedup over previous state-of-the-art diffusion-based VSR approaches.
Specifically, the distribution and quality of VSR-120K’s content help overcome issues such as spatial blurring, temporal inconsistency, and artifacts which are typically observed when training on smaller or less-rigorous datasets. Joint image–video training fosters improved reconstruction accuracy for static textures and dynamic content alike.
Empirical results indicate FlashVSR outperforms previous models both in quantitative accuracy and practical deployment bandwidth, largely due to the rich, diverse supervision VSR-120K provides.
5. Release and Reproducibility
The VSR-120K dataset, along with FlashVSR codebase and pretrained weights, is designated for public release to foster future research in efficient, scalable, and high-performance VSR. The project page https://zhuang2002.github.io/FlashVSR serves as the central access point for researchers.
The dataset composition is explicitly stated in the source: “The final dataset consists of 120K videos (average length >350 frames) and 180K high-quality images.” This assures reproducibility of results and continuity across implementations targeting practical VSR.
6. Context and Implications
VSR-120K sets a new standard for empirical scale and rigor in video super-resolution benchmarks. By combining high diversity, strict filtering, and dual supervision (static images plus video), it advances the field both for traditional frame-based SR techniques and novel diffusion-based streaming approaches. This suggests that future VSR models will increasingly rely on such large, multimodal datasets for unlocking efficiency and generalizability at high resolutions.
A plausible implication is that other domains requiring fine-grained spatial and temporal learning—such as frame interpolation, motion deblurring, and automated video editing—will benefit from similar dataset design and curation methodologies.
In conclusion, VSR-120K’s unprecedented scale, stringent quality assurance, and joint training paradigm contribute critically to the development and deployment of next-generation streaming video super-resolution systems. Its release is anticipated to support ongoing innovation in efficient, scalable vision models.