Papers
Topics
Authors
Recent
Search
2000 character limit reached

VBench++ Benchmark Suite

Updated 25 June 2026
  • VBench++ is a comprehensive benchmark suite that systematically evaluates video generative models using 16 disentangled quality dimensions and tailored prompts.
  • It decomposes video quality into intrinsic factors (temporal and frame-wise) and video–condition consistency (semantics and style), leveraging advanced metrics like DINO, CLIP, RAFT, and ViCLIP.
  • The suite also assesses trustworthiness by evaluating culture fairness, gender and skin tone bias, and safety, with open-source code and an evolving community leaderboard.

VBench++ is a comprehensive, open-source, human-aligned benchmark suite for systematic evaluation of video generative models. It extends the original VBench framework to provide multi-dimensional, disentangled, and hierarchical evaluation across both text-to-video (T2V) and image-to-video (I2V) synthesis tasks, complemented by trustworthiness analysis. VBench++ combines a large suite of tailored prompts, a battery of automated metrics with formal definitions, human-labeled pairwise preference data, and versatile leaderboard protocols to holistically assess technical performance and social reliability of generative systems (Huang et al., 2024).

1. Decomposition of Video Generation Quality

VBench++ hierarchically decomposes "video generation quality" into 16 formally defined, disentangled dimensions, each with a dedicated evaluation protocol and prompt suite. The top-level split distinguishes between intrinsic Video Quality (independent of prompt) and Video–Condition Consistency (alignment to the conditioning input).

Video Quality is further divided:

  • Temporal Quality (5 dimensions): subject consistency, background consistency, temporal flickering, motion smoothness, dynamic degree.
  • Frame-Wise Quality (2 dimensions): aesthetic quality, imaging quality.

Video–Condition Consistency comprises:

  • Semantics (6 dimensions): object class, multiple objects, human action, color, spatial relationship, scene.
  • Style (2 dimensions): appearance style, temporal style.
  • An additional Overall Consistency dimension measures combined semantics and style prompt alignment.

Each dimension possesses: (i) a specific definition, (ii) an automated metric (often leveraging DINO, CLIP, RAFT, ViCLIP, among others), and (iii) a suite of ~100 hand-crafted prompts that probe the relevant property in isolation, ensuring minimal metric confounding (Huang et al., 2024).

Video Quality Dimensions

Name Definition Metric/Feature Used
Subject Consistency Stability of subject across frames DINO VIT cosine
Background Consistency Background appearance invariance CLIP image embedding
Temporal Flickering High-frequency frame drift Mean L₁ diff, norm.
Motion Smoothness Plausibility of trajectory AMT interp error
Dynamic Degree Magnitude of movement RAFT optical flow
Aesthetic Quality Photographic appeal of frames LAION-predictor
Imaging Quality Technical defects in frames MUSIQ score

Video–Condition Consistency Dimensions

Name Definition Metric/Feature Used
Object Class Correct class generation GRiT detection
Multiple Objects Co-occurrence of all objects GRiT detection
Human Action Prompt-action execution UMT classifier
Color Color attribute fidelity GRiT caption matching
Spatial Relationship Object layout accuracy Rule-based box coordinates
Scene Global scene category fidelity Tag2Text caption match
Appearance Style Visual/art style correspondence CLIP text-image cosine
Temporal Style Camera motion style ViCLIP video-text cosine
Overall Consistency Combined semantics + style ViCLIP prompt similarity

For I2V, three additional dimensions evaluate per-frame and scene consistency with the input image and camera motion alignment using DreamSim and CoTracker, respectively.

2. Prompt Suite and Category Structure

For each of the 16 main dimensions, VBench++ employs approximately 100 specialized, handcrafted prompts designed to isolate that aspect of video synthesis without confounding factors. Categories include subject consistency (e.g., "a bear running across a field"), spatial relationship ("a red ball to the left of a blue cube"), style ("a sunflower in Van Gogh style"), and more.

Prompts are also organized into eight content categories—Animal, Architecture, Food, Human, Lifestyle, Plant, Scenery, Vehicles—to support per-category analysis. For I2V, the suite includes 1,000+ curated high-resolution images from Pexels/Pixabay and an adaptive cropping pipeline supporting 1:1 and 16:9 formats, ensuring fair aspect ratio coverage (Huang et al., 2024).

3. Automated Metric Definitions and Evaluation Protocol

Each dimension's metric is precisely defined. Notable examples:

  • Subject Consistency:

SubjCons=1T1t=1T1cos(ft,ft+1)\mathrm{SubjCons} = \frac{1}{T-1}\sum_{t=1}^{T-1} \cos(f_t, f_{t+1})

where ftf_t is the DINO feature of frame tt.

  • Temporal Flickering:

Flicker=11T1t=1T1It+1It1maxnorm\mathrm{Flicker} = 1 - \frac{1}{T-1}\sum_{t=1}^{T-1} \frac{\|I_{t+1}-I_t\|_1}{\mathrm{max}_{\mathrm{norm}}}

  • Dynamic Degree:

Dynamics=1T1t=1T1Ft1\mathrm{Dynamics} = \frac{1}{T-1}\sum_{t=1}^{T-1}\|F_t\|_1

where FtF_t is the RAFT flow magnitude.

Model evaluation proceeds via: (1) video generation at native model parameters, (2) execution of metric scripts per dimension, (3) aggregation from video to prompt to global level, and (4) optional human preference annotation.

Automated metrics correlate strongly with human preference: Spearman ρ>0.85\rho > 0.85 (statistically significant, p0.01p \ll 0.01) across all dimensions, supporting their alignment (Huang et al., 2024).

4. Human Annotation and Alignment

VBench++ provides a human-labeled dataset featuring ≈48,000 pairwise video comparisons, each focused on a single dimension. For every prompt, multiple models produce several videos; annotators conduct side-by-side preference judgments, with video, order, and rating randomized and instructions dimension-specific. Ties are permitted. Each pair is typically labeled by several annotators and subjected to quality control via pre-trials, error thresholding (<10%), and periodic re-labeling.

The principal aggregation statistic is win-ratio (model’s fraction of preferred outcomes), computed both for metric-based and human-annotation-based assessments. High human-metric alignment substantiates the benchmark’s validity for research and model development.

5. Trustworthiness and Fairness Dimensions

VBench++ evaluates not only conventional generation metrics but also four trustworthiness aspects:

  • Culture Fairness: Average cosine similarity between ViCLIP representations of videos and cultural prompts across nine cultures and multiple scenarios.
  • Gender Bias: Distributional parity in gender inference over demographically neutral prompts, scored via deviation from uniformity.
  • Skin Tone Bias: Balanced representation among three merged Fitzpatrick groups, again quantified via deviation from uniformity.
  • Safety: Fraction of videos without flagged unsafe content, detected by ensemble classifiers (NudeNet, SD Safety Checker, Q16).

Current SOTA models yield variable results (e.g., culture fairness ~85%, safety 40-55%, with object/gender/skin bias scores above zero), indicating open research challenges in ethical deployment (Huang et al., 2024). A plausible implication is that while technical fidelity of synthetic video is advancing, robust mitigation of social biases and toxic outputs remains an active problem.

6. Insights, Comparative Analysis, and Limitations

Experimental analysis reveals trade-offs:

  • High subject/background consistency and low flickering are attainable (>90%) in static or quasi-static scenes, but dynamic degree (true motion) remains substantially lower (18-70%). This suggests leading systems often "cheat" temporal metrics by generating near-static outputs.
  • Multiple object and spatial relationship scores are markedly lower (18-68% for multiple objects, 18-74% for spatial relationships), especially when compared to image-generation models such as SDXL, which exceed T2V by 30-40 percentage points.
  • Artistic style and camera-motion following remain major weaknesses: style and temporal style scores are ≤30%.
  • I2V evaluation shows consistently high frame-image agreement (>90%), but camera motion accuracy is significantly lagging (<35%).

Limitations include constrained coverage of open-source models, a primary focus on T2V/I2V tasks (rather than, e.g., video-to-video, editing), and incomplete coverage of emerging evaluation needs: controllability, 3D consistency, long-range temporal coherence, and audio-visual alignment. Future extensions will address safety, identity privacy leakage, motion physicality, and broader cross-modal editing (Huang et al., 2024).

7. Benchmark Extensions, Community Infrastructure, and Future Directions

VBench++ is fully open-sourced, with code, prompts, image suite, metrics scripts, human-annotation data, and evaluation wrappers available under permissive licensing [https://github.com/Vchitect/VBench]. The leaderboard is maintained as a HuggingFace space, supporting continuous model addition via standardized adapters.

Planned extensions include controllable editing, support for audio synchronization and 3D consistency, patch-level credit assignment, dynamic prompt and paraphrase attacks for robustness, multi-annotator consensus statistics, and integration of learned reward models for agent-based evaluation. The framework already enables curation (e.g., pruning low-quality WebVid samples) and evaluation of model trustworthiness, and will incorporate further long-video and multi-modal generation scenarios (Huang et al., 2024).

DynamicEval, a derivative benchmark, targets VBench’s exposure to static or subject-centric scenarios by introducing camera-motion–focused prompts and improved pixel-level consistency metrics. DynamicEval demonstrates that debiased background error maps and object-based tracking yield 2–4% video-level accuracy improvements and 0.19–0.44 model-level correlation increases compared to VBench, particularly under non-static conditions (Babu et al., 8 Oct 2025). This suggests that continual evolution of evaluation metrics is necessary as generative capabilities advance.

References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VBench++.