VBench++ Benchmark Suite

Updated 25 June 2026

VBench++ is a comprehensive benchmark suite that systematically evaluates video generative models using 16 disentangled quality dimensions and tailored prompts.
It decomposes video quality into intrinsic factors (temporal and frame-wise) and video–condition consistency (semantics and style), leveraging advanced metrics like DINO, CLIP, RAFT, and ViCLIP.
The suite also assesses trustworthiness by evaluating culture fairness, gender and skin tone bias, and safety, with open-source code and an evolving community leaderboard.

VBench++ is a comprehensive, open-source, human-aligned benchmark suite for systematic evaluation of video generative models. It extends the original VBench framework to provide multi-dimensional, disentangled, and hierarchical evaluation across both text-to-video (T2V) and image-to-video (I2V) synthesis tasks, complemented by trustworthiness analysis. VBench++ combines a large suite of tailored prompts, a battery of automated metrics with formal definitions, human-labeled pairwise preference data, and versatile leaderboard protocols to holistically assess technical performance and social reliability of generative systems (Huang et al., 2024).

1. Decomposition of Video Generation Quality

VBench++ hierarchically decomposes "video generation quality" into 16 formally defined, disentangled dimensions, each with a dedicated evaluation protocol and prompt suite. The top-level split distinguishes between intrinsic Video Quality (independent of prompt) and Video–Condition Consistency (alignment to the conditioning input).

Video Quality is further divided:

Temporal Quality (5 dimensions): subject consistency, background consistency, temporal flickering, motion smoothness, dynamic degree.
Frame-Wise Quality (2 dimensions): aesthetic quality, imaging quality.

Video–Condition Consistency comprises:

Semantics (6 dimensions): object class, multiple objects, human action, color, spatial relationship, scene.
Style (2 dimensions): appearance style, temporal style.
An additional Overall Consistency dimension measures combined semantics and style prompt alignment.

Each dimension possesses: (i) a specific definition, (ii) an automated metric (often leveraging DINO, CLIP, RAFT, ViCLIP, among others), and (iii) a suite of ~100 hand-crafted prompts that probe the relevant property in isolation, ensuring minimal metric confounding (Huang et al., 2024).

Video Quality Dimensions

Name	Definition	Metric/Feature Used
Subject Consistency	Stability of subject across frames	DINO VIT cosine
Background Consistency	Background appearance invariance	CLIP image embedding
Temporal Flickering	High-frequency frame drift	Mean L₁ diff, norm.
Motion Smoothness	Plausibility of trajectory	AMT interp error
Dynamic Degree	Magnitude of movement	RAFT optical flow
Aesthetic Quality	Photographic appeal of frames	LAION-predictor
Imaging Quality	Technical defects in frames	MUSIQ score

Video–Condition Consistency Dimensions

Name	Definition	Metric/Feature Used
Object Class	Correct class generation	GRiT detection
Multiple Objects	Co-occurrence of all objects	GRiT detection
Human Action	Prompt-action execution	UMT classifier
Color	Color attribute fidelity	GRiT caption matching
Spatial Relationship	Object layout accuracy	Rule-based box coordinates
Scene	Global scene category fidelity	Tag2Text caption match
Appearance Style	Visual/art style correspondence	CLIP text-image cosine
Temporal Style	Camera motion style	ViCLIP video-text cosine
Overall Consistency	Combined semantics + style	ViCLIP prompt similarity

For I2V, three additional dimensions evaluate per-frame and scene consistency with the input image and camera motion alignment using DreamSim and CoTracker, respectively.

2. Prompt Suite and Category Structure

For each of the 16 main dimensions, VBench++ employs approximately 100 specialized, handcrafted prompts designed to isolate that aspect of video synthesis without confounding factors. Categories include subject consistency (e.g., "a bear running across a field"), spatial relationship ("a red ball to the left of a blue cube"), style ("a sunflower in Van Gogh style"), and more.

Prompts are also organized into eight content categories—Animal, Architecture, Food, Human, Lifestyle, Plant, Scenery, Vehicles—to support per-category analysis. For I2V, the suite includes 1,000+ curated high-resolution images from Pexels/Pixabay and an adaptive cropping pipeline supporting 1:1 and 16:9 formats, ensuring fair aspect ratio coverage (Huang et al., 2024).

3. Automated Metric Definitions and Evaluation Protocol

Each dimension's metric is precisely defined. Notable examples:

Subject Consistency:

$\mathrm{SubjCons} = \frac{1}{T-1}\sum_{t=1}^{T-1} \cos(f_t, f_{t+1})$

where $f_t$ is the DINO feature of frame $t$ .

Temporal Flickering:

$\mathrm{Flicker} = 1 - \frac{1}{T-1}\sum_{t=1}^{T-1} \frac{\|I_{t+1}-I_t\|_1}{\mathrm{max}_{\mathrm{norm}}}$

Dynamic Degree:

$\mathrm{Dynamics} = \frac{1}{T-1}\sum_{t=1}^{T-1}\|F_t\|_1$

where $F_t$ is the RAFT flow magnitude.

Model evaluation proceeds via: (1) video generation at native model parameters, (2) execution of metric scripts per dimension, (3) aggregation from video to prompt to global level, and (4) optional human preference annotation.

Automated metrics correlate strongly with human preference: Spearman $\rho > 0.85$ (statistically significant, $p \ll 0.01$ ) across all dimensions, supporting their alignment (Huang et al., 2024).

4. Human Annotation and Alignment

VBench++ provides a human-labeled dataset featuring ≈48,000 pairwise video comparisons, each focused on a single dimension. For every prompt, multiple models produce several videos; annotators conduct side-by-side preference judgments, with video, order, and rating randomized and instructions dimension-specific. Ties are permitted. Each pair is typically labeled by several annotators and subjected to quality control via pre-trials, error thresholding (<10%), and periodic re-labeling.

The principal aggregation statistic is win-ratio (model’s fraction of preferred outcomes), computed both for metric-based and human-annotation-based assessments. High human-metric alignment substantiates the benchmark’s validity for research and model development.

5. Trustworthiness and Fairness Dimensions

VBench++ evaluates not only conventional generation metrics but also four trustworthiness aspects:

Culture Fairness: Average cosine similarity between ViCLIP representations of videos and cultural prompts across nine cultures and multiple scenarios.
Gender Bias: Distributional parity in gender inference over demographically neutral prompts, scored via deviation from uniformity.
Skin Tone Bias: Balanced representation among three merged Fitzpatrick groups, again quantified via deviation from uniformity.
Safety: Fraction of videos without flagged unsafe content, detected by ensemble classifiers (NudeNet, SD Safety Checker, Q16).

Current SOTA models yield variable results (e.g., culture fairness ~85%, safety 40-55%, with object/gender/skin bias scores above zero), indicating open research challenges in ethical deployment (Huang et al., 2024). A plausible implication is that while technical fidelity of synthetic video is advancing, robust mitigation of social biases and toxic outputs remains an active problem.

6. Insights, Comparative Analysis, and Limitations

Experimental analysis reveals trade-offs:

High subject/background consistency and low flickering are attainable (>90%) in static or quasi-static scenes, but dynamic degree (true motion) remains substantially lower (18-70%). This suggests leading systems often "cheat" temporal metrics by generating near-static outputs.
Multiple object and spatial relationship scores are markedly lower (18-68% for multiple objects, 18-74% for spatial relationships), especially when compared to image-generation models such as SDXL, which exceed T2V by 30-40 percentage points.
Artistic style and camera-motion following remain major weaknesses: style and temporal style scores are ≤30%.
I2V evaluation shows consistently high frame-image agreement (>90%), but camera motion accuracy is significantly lagging (<35%).

Limitations include constrained coverage of open-source models, a primary focus on T2V/I2V tasks (rather than, e.g., video-to-video, editing), and incomplete coverage of emerging evaluation needs: controllability, 3D consistency, long-range temporal coherence, and audio-visual alignment. Future extensions will address safety, identity privacy leakage, motion physicality, and broader cross-modal editing (Huang et al., 2024).

7. Benchmark Extensions, Community Infrastructure, and Future Directions

VBench++ is fully open-sourced, with code, prompts, image suite, metrics scripts, human-annotation data, and evaluation wrappers available under permissive licensing [https://github.com/Vchitect/VBench]. The leaderboard is maintained as a HuggingFace space, supporting continuous model addition via standardized adapters.

Planned extensions include controllable editing, support for audio synchronization and 3D consistency, patch-level credit assignment, dynamic prompt and paraphrase attacks for robustness, multi-annotator consensus statistics, and integration of learned reward models for agent-based evaluation. The framework already enables curation (e.g., pruning low-quality WebVid samples) and evaluation of model trustworthiness, and will incorporate further long-video and multi-modal generation scenarios (Huang et al., 2024).

DynamicEval, a derivative benchmark, targets VBench’s exposure to static or subject-centric scenarios by introducing camera-motion–focused prompts and improved pixel-level consistency metrics. DynamicEval demonstrates that debiased background error maps and object-based tracking yield 2–4% video-level accuracy improvements and 0.19–0.44 model-level correlation increases compared to VBench, particularly under non-static conditions (Babu et al., 8 Oct 2025). This suggests that continual evolution of evaluation metrics is necessary as generative capabilities advance.

References:

VBench: Comprehensive Benchmark Suite for Video Generative Models (Huang et al., 2023)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models (Huang et al., 2024)
DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis (Babu et al., 8 Oct 2025)

Markdown Report Issue Upgrade to Chat

References (3)

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models (2024)

DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis (2025)

VBench: Comprehensive Benchmark Suite for Video Generative Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VBench++.

VBench++ Benchmark Suite

1. Decomposition of Video Generation Quality

Video Quality Dimensions

Video–Condition Consistency Dimensions

2. Prompt Suite and Category Structure

3. Automated Metric Definitions and Evaluation Protocol

4. Human Annotation and Alignment

5. Trustworthiness and Fairness Dimensions

6. Insights, Comparative Analysis, and Limitations

7. Benchmark Extensions, Community Infrastructure, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VBench++ Benchmark Suite

1. Decomposition of Video Generation Quality

Video Quality Dimensions

Video–Condition Consistency Dimensions

2. Prompt Suite and Category Structure

3. Automated Metric Definitions and Evaluation Protocol

4. Human Annotation and Alignment

5. Trustworthiness and Fairness Dimensions

6. Insights, Comparative Analysis, and Limitations

7. Benchmark Extensions, Community Infrastructure, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research