Generated Video Dataset Overview
- Generated Video Datasets (GVDs) are curated collections of AI-synthesized video samples used to benchmark authenticity detection and quality evaluation.
- They employ diverse generation pipelines such as prompt engineering, multi-model synthesis, human curation, and post-processing to ensure robust data characteristics.
- Applications span video forensics, anomaly detection, and dataset distillation, while challenges include temporal limitations and annotation variability.
A Generated Video Dataset (GVD) is a curated or constructed collection of video samples synthesized by one or more AI-driven generative models, distinct from naturally captured (real-world) footage. GVDs play a pivotal role in research on generative modeling, video forensics, anomaly detection, dataset distillation, and quality evaluation. These resources vary in scope from compact, domain-specific corpora to million-scale benchmarks spanning diverse content, modalities, and artifact annotations.
1. Taxonomy and Main Types of Generated Video Datasets
Generated Video Datasets can be categorized by their primary purpose and construction methodology:
- Forensics and Detection Benchmarks: Datasets such as GenVideo (Chen et al., 30 May 2024), GenBuster-200K (Wen et al., 19 May 2025), and GenVidBench (Ni et al., 20 Jan 2025) are balanced collections of real and AI-generated videos optimized for training, testing, and benchmarking video authenticity detection models. These benchmarks emphasize diversity in content and generator architectures, cross-source splits, and rigorous evaluation protocols.
- Artifact and Quality Annotation Corpora: GeneVA (Kang et al., 10 Sep 2025) and BrokenVideos (Lin et al., 25 Jun 2025) provide pixel-level or spatio-temporal annotation of visual artifacts. They enable training and evaluation of fine-grained artifact detectors and the study of generative model failure modes.
- Generated Data for Data Augmentation and Domain Coverage: Synthetic datasets constructed for task-specific augmentation (e.g., the GV-VAD GVD for video anomaly detection (Cai et al., 1 Aug 2025)) address rare-event scarcity and controlled domain balancing.
- Scenario and Simulation Synthesis: Domain-specific GVDs such as GenDDS (Fu et al., 28 Aug 2024) are constructed by sampling from a parameterized generative pipeline (weather, road type, traffic) for the simulation of rare or safety-critical contexts, commonly in robotics and autonomous driving.
- Dataset Distillation and Data Compression: Methods such as diffusion-based video dataset condensation (Li et al., 10 May 2025) generate compact synthetic datasets to replace or supplement real data with high information density.
2. Dataset Construction Methodologies and Generation Pipelines
GVD construction follows diverse protocols depending on intended use:
- Prompt Engineering and Content Control: Most GVDs leverage structured or natural-language prompts. For example, GV-VAD defines “anomaly description elements” (viewpoint, location, subject, event) to systematically generate paired normal/anomalous videos (Cai et al., 1 Aug 2025). GenDDS builds prompts as fixed-template sentences populated by semantic auto-tags from real datasets (Fu et al., 28 Aug 2024).
- Generator Diversity and Model Coverage: Large GVDs (e.g., GenVideo, GenBuster-200K, GenVidBench) aggregate outputs from numerous state-of-the-art models—covering text-to-video (T2V), image-to-video (I2V), GANs, diffusion architectures, and transformers—to maximize content and artifact diversity (Chen et al., 30 May 2024, Wen et al., 19 May 2025, Ni et al., 20 Jan 2025). Real vs. fake balancing, cross-generator train-test splits, and inclusion of commercial (“black-box”) generators enhance robustness.
- Post-processing and Standardization: Resizing (to e.g. 224×224 or 1024×1024), length clipping (e.g. 5 s exactly in GenBuster-200K), frame-rate normalization (8–30 FPS), encoding (HEVC, MP4), and artifact equalization (JPEG compression, watermark cropping) are typical, with implementation choices made to harmonize heterogeneous generator outputs (Wen et al., 19 May 2025, Bai et al., 25 Mar 2024).
- Human/in-The-Loop Curation: Quality control may involve direct expert or crowdsourced review. In GenBuster-200K, real and fake samples undergo human verification in closed benchmarks (Wen et al., 19 May 2025). For artifact annotation, protocols in GeneVA and BrokenVideos include bounding-box or pixel-level mask annotation and multi-pass validation (Kang et al., 10 Sep 2025, Lin et al., 25 Jun 2025).
- Synthetic Dataset Selection and Condensation: Compact and information-dense GVDs are created using latent video diffusion models coupled with diversity-promoting selection (e.g., VST-UNet, TAC-DT; (Li et al., 10 May 2025)). These pipelines start with large pools of generated videos and select class-representative, diverse subsets by optimizing entropy or representativeness in the learned embedding space.
3. Core Properties and Statistical Summaries
Key dataset attributes include scale, resolution, duration, model origin diversity, labeling, splits, and statistical balancing.
| Dataset | Scale (clips) | Generators | Annotation | Resolution/Format | Noteworthy Splits |
|---|---|---|---|---|---|
| GenVideo | 2.26M train | 10+ open, 10+ test | real/fake | 224–2048 px; mp4 | cross-generator, degraded (Chen et al., 30 May 2024) |
| GenBuster-200K | 200k | 5 open+8 comm. | real/fake | 1024×1024 HEVC | train/indomain/closed-bm (Wen et al., 19 May 2025) |
| GenVidBench | 143k | 8, cross-source | real/fake/semant. | 224–1280 px, 8–30 FPS | train (cohort 1), test (cohort 2) (Ni et al., 20 Jan 2025) |
| BrokenVideos | 3.2k AI-gen | ~10 SOTA | pixel masks | 416×624–1760×1152 | train/val/balanced-clean (Lin et al., 25 Jun 2025) |
| GeneVA | 16k AI-gen | Sora, Pika, VC2 | artifacts (boxes) | mixed, ~5s, mp4 | random, human split (Kang et al., 10 Sep 2025) |
| GV-VAD GVD | 600 AI-gen | CogVideoX | normal/anom. | 256–512×256–512, 16f | ~50% class balance (Cai et al., 1 Aug 2025) |
Significant properties:
- Real vs. fake balancing: Most detection GVDs enforce near 1:1 ratios for training; some test sets skew toward more challenging fake-only or out-of-domain splits (Wen et al., 19 May 2025, Chen et al., 30 May 2024).
- Semantic tag diversity: GenVidBench annotates three axes (objects/actions/locations, ≤10 classes each) and ensures uniform population of each semantic cell (Ni et al., 20 Jan 2025).
- Artifact annotation density: BrokenVideos ensures 1–2 annotated artifact regions for 80% of clips, while GeneVA allows multi-label bounding boxes (mean ~1–1.5/video) (Lin et al., 25 Jun 2025, Kang et al., 10 Sep 2025).
- Scene and modality coverage: Datasets span human activity, animals, nature, urban environments, vehicles, and non-visual modalities where relevant. Prompts or tags are designed to stratify content and avoid dominance by any single scene type (Ma et al., 3 Feb 2024, Ni et al., 20 Jan 2025).
4. Labeling, Annotation, and Evaluation Tasks
GVDs serve as ground truth for multiple distinct tasks:
- Authenticity Detection: Binary real/fake labels for supervised learning, often with generator IDs for cross-generator robustness (Chen et al., 30 May 2024, Ni et al., 20 Jan 2025).
- Artifact Localization: BrokenVideos provides per-frame, per-region pixel masks highlighting generated artifacts. GeneVA assigns bounding boxes and textual descriptions to up to five artifact instances per video, using a five-category taxonomy (shape, motion, physics, visual, other) (Lin et al., 25 Jun 2025, Kang et al., 10 Sep 2025).
- Semantic Tagging: In GenVidBench, expert-vetted multi-axis semantic labels enable fine-grained per-category analysis of detection performance; GVF includes multi-hot spatial and temporal control attributes (Ni et al., 20 Jan 2025, Ma et al., 3 Feb 2024).
- Quality and Alignment Ratings: GeneVA collects 7-point Likert quality/alignment scores for each synthetically generated clip (Kang et al., 10 Sep 2025).
5. Evaluation Protocols, Metrics, and Baseline Results
Standard GVD evaluation protocols span:
- Cross-Generator and Cross-Source Testing: GenVideo and GenVidBench construct splits where training and test sets are disjoint at the generator or source level to prevent overfitting to a single model's artifacts (Chen et al., 30 May 2024, Ni et al., 20 Jan 2025).
- Quality Metrics: Human-assessed (e.g., quality Likert scores Q, alignment scores A (Kang et al., 10 Sep 2025)), automated AP (average precision at various thresholds), and generative realism measurements (FVD, IS) are common (Fu et al., 28 Aug 2024, Kang et al., 10 Sep 2025).
- Fine-Grained Artifact Localization: Region similarity (J, intersection over union), boundary F-measure (F), and mean J&F are used for pixel-mask benchmarks (Lin et al., 25 Jun 2025).
- Robustness Analysis: Degraded video classification applies real-world corruptions (compression, watermark, crops) and reports robustness drops (ΔACC and ΔAP) (Chen et al., 30 May 2024).
Baseline detector performance—using SOTA video backbones and MLLMs—shows a marked generalization gap from in-generator to cross-generator contexts (e.g., MViT V2 cross-generator top-1 ACC: 79.9% (Ni et al., 20 Jan 2025); DeMamba improves robustness and generalizability by ~9–14 points (Chen et al., 30 May 2024)).
6. Applications and Limitations
GVDs underpin objectives across video authenticity, anomaly detection, generative model improvement, and compact data distillation:
- Forensics Tools: Benchmarks drive the development/evaluation of detectors (AIGVDet, DeMamba, BusterX), focusing on both generalization and explainability (Bai et al., 25 Mar 2024, Chen et al., 30 May 2024, Wen et al., 19 May 2025).
- Annotation-Guided Model Correction: Pixel-level artifacts from BrokenVideos or bounding-boxes in GeneVA can become inpainting priors for model fine-tuning (Lin et al., 25 Jun 2025, Kang et al., 10 Sep 2025).
- Synthetic Data Bootstrapping: Augmenting rare event scenarios, as in GV-VAD, yields significant gains in data-scarce settings compared to real-only training (Cai et al., 1 Aug 2025).
- Dataset Economy: Condensed or distilled GVDs (via VST-UNet, TAC-DT) retain performance while drastically reducing storage/training overhead (Li et al., 10 May 2025).
Limitations:
- Temporal Limits: Most generated video clips are short (≤5–8 seconds) due to computational constraints; scaling to “minutes” or longer remains unresolved (Fu et al., 28 Aug 2024, Lin et al., 25 Jun 2025).
- Subjective Annotation Variability: Current datasets rely on single-instance or single-annotator labels; future expansions may adopt repeated or probabilistic annotations to quantitate inter-annotator reliability (Kang et al., 10 Sep 2025).
- Model and Content Drift: Rapid innovation in generative methods and emerging modalities (e.g., audio, multimodal) necessitate ongoing updates for comprehensive coverage (Ni et al., 20 Jan 2025).
- Scene Labeling Granularity: Some resources release limited semantic tags or no per-scene breakdowns due to annotation complexities or privacy (Wen et al., 19 May 2025).
7. Access, Licensing, and Community Resources
- Public Repositories and Download: Most major GVDs (GenVideo, GenVidBench, BrokenVideos, GVF) are released with download scripts, metadata, and (where relevant) code for dataset parsing and baseline evaluation (Chen et al., 30 May 2024, Ni et al., 20 Jan 2025, Lin et al., 25 Jun 2025, Ma et al., 3 Feb 2024).
- Licensing: Licensing varies. For BrokenVideos, non-commercial CC BY-NC 4.0 applies, requiring contact for commercial use (Lin et al., 25 Jun 2025); licensing for GenBuster-200K, GenVideo, and some forensics datasets to be announced at release (Wen et al., 19 May 2025).
- Tooling: Some datasets/benchmarks provide APIs or codebases to facilitate integration into research workflows; e.g., AIGVDet code and list files (Bai et al., 25 Mar 2024).
- Quality Assurance: Validation protocols include manual review and spot-checking, with error rates typically maintained under 1% (Ni et al., 20 Jan 2025, Wen et al., 19 May 2025).
Generated Video Datasets are foundational for advancing the evaluation, detection, and improvement of generative video modeling. The field is characterized by rapid expansion in both the scale and compositional complexity of these resources, driven by progress in synthetic video generation, a demand for robust forensics, and increasing emphasis on fine-grained quality assessment and artifact localization.