Video Preference Dataset Overview
- Video preference datasets are curated collections of videos, prompts, and pairwise annotations that capture human judgments on alignment criteria.
- They support reward modeling and reinforcement learning by providing fine-grained labels for safety, aesthetics, and semantic accuracy.
- Applications include video generation alignment, quality evaluation, and content moderation, leading to measurable improvements in model performance.
A video preference dataset is a curated collection of videos, prompts, and pairwise annotations that encode human or automatically synthesized judgments about which generated or retrieved video segments are preferred according to specific alignment criteria. These datasets are essential for training, evaluating, and aligning video generation and video-LLMs (Video-LLMs) to reflect complex human values, semantic accuracy, safety, aesthetic preferences, temporal and spatial grounding, and other fine-grained video understanding dimensions.
1. Conceptual Framework and Motivation
Video preference datasets operationalize the notion of pairwise alignment between videos and textual prompts, enabling robust reward modeling and fine-tuning procedures. The motivating factors include:
- Explicit measurement of human judgment, encompassing both positive preferences (which video better fulfils the prompt) and negative avoidance (which video best avoids harmful or undesirable content, such as violence or discrimination).
- Decoupling of critical dimensions such as “helpfulness” and “harmlessness,” as exemplified by SafeSora’s dual schema for text-to-video safety alignment (Dai et al., 2024).
- Automated or human-in-the-loop construction for domains where manual annotation is scarce, unreliable, or prohibitively expensive.
These datasets underpin contemporary research in reward modeling, direct preference optimization (DPO), reinforcement learning from human feedback (RLHF/RLAIF), and video moderation.
2. Data Composition and Annotation Schemas
Video preference datasets vary in scale, annotation protocol, modality coverage, and granularity. The following table summarizes major recent datasets:
| Dataset | Pair Count | Annotation Source | Dimensions / Labels |
|---|---|---|---|
| SafeSora | 51,691 | Human (crowdworker) | Helpfulness (4 subdims), Harmlessness (12 tags) |
| VideoDPO | 10,000 | OmniScore (auto) | Visual/semantic weighted score, best-vs-worst |
| HuViDPO | ≤25,000 | Human (small-scale) | Aesthetic score pairwise judgements |
| MJ-BENCH-VIDEO | 5,421 | Human + GPT-4 | 28 fine-grained criteria (5 aspects) |
| VideoPASTA | 7,020 | Synthetic/auto | Spatial, Temporal, Cross-frame adversaries |
| VistaDPO-7k | 7,200 | Manual + GPT-4 | Multi-level: instance, temporal, perceptive |
SafeSora’s schema explicitly structures annotations into primary dimensions and sub-categories, with crowdworkers labeling video-pair comparisons for both requested helpfulness components (instruction following, correctness, informativeness, aesthetics) and for 12 multi-hot harm tags (adult, violence, drugs, discrimination, etc.) (Dai et al., 2024). Similarly, MJ-BENCH-VIDEO embraces 28 detailed criteria across alignment, safety, fineness, coherence/consistency, and bias/fairness, rated per pair by multiple raters (Tong et al., 3 Feb 2025). Synthetic datasets such as VideoPASTA and VistaDPO supplement these frameworks by constructing adversarial pairs and hierarchical groundings without human annotation (Kulkarni et al., 18 Apr 2025, Huang et al., 17 Apr 2025).
3. Automated Preference Scoring and Synthetic Data Generation
Recent research exploits automatic preference scoring systems as alternatives to manual annotation. VideoDPO introduces OmniScore, a scalar fused evaluation over seven axes including intra-frame visual quality, temporal smoothness, subject consistency, and semantic alignment, leveraging pre-trained model outputs and weighted aggregation (Liu et al., 2024). Preference pairs are generated by uniformly sampling prompts, generating N videos per prompt, scoring each video via OmniScore, and forming "best-versus-worst" comparisons. The re-weighting strategy further emphasizes rare, sharply distinctive pairs through empirical score histograms.
Synthetic data pipelines, demonstrated in VideoPASTA, TimeWarp, and Temporal Preference Optimization (TPO), produce challenging adversarial pairs by manipulating video sampling rates, permuting scene order, or masking critical content. Temporal grounding datasets incorporate preference judgments for localized and comprehensive event reasoning, bootstrapped via LLM-based captioning, question generation, and model-in-context negative sampling (Li et al., 23 Jan 2025, Vani et al., 4 Oct 2025).
4. Preference Labeling, Data Formats, and Quality Control
Most video preference datasets employ binary (winner/loser) or multi-label schemas, encoded in tabular or JSON formats for downstream DPO or RLHF use. SafeSora’s pairwise labeling combines categorical and multi-hot harm tags, structured as follows:
1 2 3 4 5 6 7 8 |
{
"prompt": "A cat jumps across a river.",
"video_A": "cat_jump_A.mp4",
"video_B": "cat_jump_B.mp4",
"helpfulness": {"A": {…}, "B": {…}}, // subdim scores
"harmlessness": {"A": [tags…], "B": [tags…]},
"preference": "A"
} |
Automatic datasets record chosen and rejected video paths, OmniScores, or failure mode types; VideoDPO stores preferences as:
1 2 3 4 5 6 7 8 |
{
"prompt": "...",
"win_path": "path/to/best.mp4",
"lose_path": "path/to/worst.mp4",
"score_win": 0.85,
"score_lose": 0.60,
"weight": 1.25
} |
Quality assurance includes inter-annotator agreement (e.g., Fleiss' κ ≈ 0.92 for VistaDPO-7k (Huang et al., 17 Apr 2025)), filtering of ambiguous or low-signal pairs (as in MMAIP-V (Yi et al., 2024)), and post-filtering by LLM referees.
5. Model Alignment and Optimization Protocols
Video preference datasets provide the backbone for preference-based model alignment. Direct Preference Optimization (DPO) is the prevailing objective, defined as:
Hierarchical approaches (VistaDPO (Huang et al., 17 Apr 2025)), multi-task losses (MJ-VIDEO (Tong et al., 3 Feb 2025)), and extrapolated iterative refinement (Iter-W2S-RLAIF (Yi et al., 2024)) advance the objective’s scope, often integrating criteria-level regression, aspect preference routing, or token-level KL divergences. Preference data enable model adaptation for video generation and video QA, improving human-aligned accuracy and semantic fidelity (e.g., MJ-VIDEO's +17.6% strict preference score over InternVL2-2B (Tong et al., 3 Feb 2025)).
6. Applications and Empirical Impact
Video preference datasets serve several key functions:
- Reward model training for fine-grained video generation evaluation and filtering (Tong et al., 3 Feb 2025).
- Alignment of diffusion models and LVMs for instruction-following and harmlessness (Dai et al., 2024, Liu et al., 2024).
- Robust video QA and captioning, mitigating hallucination and enforcing spatial/temporal grounding (Huang et al., 17 Apr 2025, Kulkarni et al., 18 Apr 2025).
- Personalized highlight detection and moment retrieval, as in HIPPO-Video’s simulated watch histories (Lee et al., 22 Jul 2025).
- Safety moderation, bias audit, and equity-driven content validation (e.g., SafeSora, MJ-BENCH-VIDEO).
- Benchmarking algorithms against multi-dimensional human-centric standards.
Empirical results quantify substantial model gains: SafeSora, MJ-VIDEO, and VideoPASTA each report preference learning improvements of 2–17.6% strict accuracy across multiple video understanding tasks (Dai et al., 2024, Tong et al., 3 Feb 2025, Kulkarni et al., 18 Apr 2025). Synthetic preference signals consistently bridge the gap between model-centric and human-centric video reasoning.
7. Limitations, Trends, and Future Directions
SafeSora, MJ-BENCH-VIDEO, and HIPPO-Video note several unresolved limitations:
- Human annotation bottleneck, especially for subjective criteria such as aesthetics or personalized preferences.
- Modal bias in sampling (over-emphasis on visual quality versus semantics, context, or safety).
- Underrepresentation of long-form temporal dynamics, multi-session behaviors, and multimodal (audio, engagement metrics) interactions.
- Potential simulation artifacts or divergence from true human attention, particularly in LLM-simulated datasets such as HIPPO-Video (Lee et al., 22 Jul 2025).
Current trends include the integration of multi-modal, multi-task, and hierarchical grounding, adversarial failure mode coverage, refinement of reward functions, and unbiased referee-driven evaluation. Prospective enhancements involve privacy-safe real user logging, reinforcement learning-based simulator realism, richer semantic and temporal coverage, and the publication of standardized quality metrics.
A plausible implication is that as preference datasets become more comprehensive and multifaceted, the alignment of generative models to nuanced human values and video semantics will progressively improve, offsetting limitations inherent in scale, simulation, and annotation noise.