Papers
Topics
Authors
Recent
2000 character limit reached

VMBench: Video Motion Benchmark

Updated 4 February 2026
  • VMBench is a benchmark that evaluates video motion quality using a suite of perception-driven metrics (CAS, MSS, OIS, PAS, and TCS) aligned with human assessments.
  • Its Meta-Guided Motion Prompt Generation pipeline creates and refines over 1,000 validated prompts from diverse datasets, ensuring physical plausibility and semantic coherence.
  • The framework provides detailed diagnostics of T2V models by quantifying motion strengths and pinpointing failure modes such as object persistence issues and temporal artifacts.

VMBench is a comprehensive benchmark for evaluating video motion generation with metrics and prompts designed to align with nuanced human perception of motion. Unlike prior benchmarks, VMBench explicitly decomposes the assessment of motion in generated videos into perception-driven, fine-grained criteria, provides a structured pipeline for diverse, validated motion prompts, and introduces a human-validated evaluation process. This enables rigorous quantification and diagnosis of motion quality in text-to-video (T2V) models, offering novel tools for advancing motion-specific capabilities in generative video modeling (Ling et al., 13 Mar 2025).

1. Perception-Driven Motion Evaluation Metrics

Central to VMBench is the Perception-Driven Motion Metrics (PMM) suite, capturing five distinct facets of motion quality, each linked to stages of human perceptual assessment:

  • Commonsense Adherence Score (CAS):

CAS=i=15piG(i)\text{CAS} = \sum_{i=1}^{5} p_i G(i)

where pip_i is the predicted probability that a video belongs to class %%%%1%%%%, and G(i)G(i) assigns empirical weights {0,0.25,0.5,0.75,1}\{0,0.25,0.5,0.75,1\} mapping semantic gradations from "Bad" to "Perfect." CAS reflects global physical plausibility.

  • Motion Smoothness Score (MSS):

MSS=11Tt=2TI(ΔQt>τs(t))\text{MSS} = 1 - \frac{1}{T}\sum_{t=2}^{T} \mathbb{I}(\Delta Q_t > \tau_s(t))

where ΔQt\Delta Q_t is the inter-frame drop of a learned quality score and τs\tau_s an adaptive threshold. MSS directly targets temporal artifacts and visual discontinuities.

  • Object Integrity Score (OIS):

OIS=1FKf=1Fk=1KI(Df(k)τ(k))\text{OIS} = \frac{1}{F \cdot K}\sum_{f=1}^F\sum_{k=1}^K \mathbb{I}(\mathcal{D}_f^{(k)} \leq \tau^{(k)})

Here, Df(k)\mathcal{D}_f^{(k)} measures deviation in length and angles of tracked anatomical components (e.g., limbs), quantifying biological or mechanical plausibility.

  • Perceptible Amplitude Score (PAS):

PAS=1Tt=1Tmin(Dˉtτs,1)\text{PAS} = \frac{1}{T}\sum_{t=1}^T \min\left(\frac{\bar D_t}{\tau_s}, 1\right)

with Dˉt\bar D_t the mean displacement of keypoints at frame tt; PAS discerns whether the primary subject's movement surpasses perceptual thresholds, discriminating static or minimally dynamic videos.

  • Temporal Coherence Score (TCS):

TCS=11Ni=1NI(Ai¬R)\text{TCS} = 1 - \frac{1}{N}\sum_{i=1}^N \mathbb{I}(\mathcal{A}_i \land \neg \mathcal{R})

where Ai\mathcal{A}_i identifies unjustified disappearances/reappearances and R\mathcal{R} encodes legitimate occlusion/exit. TCS targets unnatural temporal discontinuities.

The PMM suite collectively enables fine-grained, interpretable diagnoses at the motion quality level, with each metric shown to correlate strongly with human judgment, achieving an average improvement in Spearman’s rank correlation (ρ\rho) of 35.3 percentage points over the previous best metrics (Ling et al., 13 Mar 2025).

2. Meta-Guided Motion Prompt Generation

VMBench implements a structured pipeline—Meta-Guided Motion Prompt Generation (MMPG)—to create a large, diverse, and physically plausible prompt library for motion evaluation:

  • Meta-information Extraction: Prompts are generated from metadata triples of Subject (S), Place (P), and Action (A) parsed and expanded from datasets such as VidProm, MSR-VTT, Didemo, WebVid, Place365, and Kinect-700 using Qwen-2.5 and further extended by GPT-4o. This supports compositional generalization across subject types, locations, and actions.
  • Self-Refining Prompt Generation: LLMs draft textual descriptions for sampled (S,P,A)(S,P,A) combinations, iteratively refining them for semantic coherence and physical plausibility, leading to a candidate pool of ~50,000 prompts.
  • Human–LLM Joint Validation: DeepSeek-R1 LLM reward models provide initial plausibility filtering. Human annotators then vet top-ranked prompts for real-world feasibility, yielding 1,050 curated motion prompts.

This approach ensures broad coverage across the motion spectrum, with empirical human validation that eliminates physically impossible or semantically incoherent scenarios (Ling et al., 13 Mar 2025).

3. Multi-Level Motion Prompt Library Structure

The prompt library in VMBench is hierarchically organized and comprehensive in both breadth and granularity:

  • Dynamic Scene Dimensions: Prompts are categorized according to six locomotion and interaction modes: Fluid Dynamics, Biological Motion, Mechanical Motion, Weather Phenomena, Collective Behavior, and Energy Transfer.
  • Granularity Levels:
    • Level 1: Metadata triples (S,P,A)(S,P,A)
    • Level 2: LLM-generated textual scenarios (10–60 words)
    • Level 3: Final validated prompts

The arrangement across 969 raw categories and 1,050 validated prompts enables targeted and diverse stress-testing of motion reasoning in generative video models (Ling et al., 13 Mar 2025).

4. Human-Aligned Validation and Annotation Protocol

VMBench provides a dual-layered validation mechanism to maximize correlation with human assessment:

  • Human Preference Annotations: Annotators rate T2V model outputs across the five PMM dimensions using a Likert scale; 1,200 unique videos sampled from six T2V models are rated to construct robust pairwise preference ground-truth labels.
  • Quantitative Ground Truth: For each human-annotated prompt, all (62)=15\binom{6}{2}=15 model output pairs are compared, with preferences determined by aggregate scores.
  • Correlation Analysis: The human scoring correlates tightly with PMM (average ρ×100=62.2%\rho \times 100 = 62.2\%), far exceeding baseline metrics (InternVideo2.5: 26.9%), confirming the benchmark's human-alignment (Ling et al., 13 Mar 2025).

5. Diagnostic Results and Model Evaluation

VMBench supports detailed comparative analysis across T2V models and elucidates specific motion-related failure modes:

  • Superior Human Alignment: PMM outperforms both rule-based (e.g., RAFT, SSIM, CLIP, DINO, AMT) and MLLM-based (LLaVA-NEXT, MiniCPM-V) scoring in all five dimensions.
  • Per-Dimension Performance: Spearman ρ\rho (multiplied by 100) for individual metrics: CAS 69.9, MSS 77.1, OIS 65.8, PAS 65.2, TCS 54.5.
  • Model Ranking: Among six open-source models, Wan2.1 ranks highest (78.4% avg.), with HunyuanVideo and CogVideoX following.
  • Identified Failure Modes:
  1. Object Persistence Paradox – IDs switch or objects vanish.
  2. Structural Degeneration – limb warping.
  3. Temporal Artifacts – frame blending.
  4. Newtonian Violations – physical implausibilities and energy non-conservation.

The benchmark thus pinpoints domains where current T2V models remain deficient and where future architecture or training advances are most likely to yield perceptual improvements (Ling et al., 13 Mar 2025).

6. Extensions and Implications for Future Research

VMBench establishes new standards for perception-aligned video motion evaluation, offering actionable guidance for model development and future benchmark expansion:

  • Perception-aligned Diagnostics: The PMM metrics permit precise attribution of model failures to perceptual motion criteria, guiding targeted architecture and training interventions.
  • Prompt Library Utility: The validated prompt set may serve as a stress test for novel motion reasoning abilities or scaling tests for emerging T2V models.
  • Benchmark Scalability: VMBench's joint LLM-human curation and annotation pipeline offers a scalable path for iteratively expanding or adapting the benchmark to new modalities or generative settings.
  • Research Tracking: By standardizing motion evaluation on human-validated metrics and prompts, VMBench enables rigorous quantitative tracking of progress in motion generation.

VMBench thus serves both as a practical evaluation suite and as an impetus toward more human-aligned, interpretable, and diagnostically robust research in video motion generation (Ling et al., 13 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VMBench.