MoveBench: Benchmark for Motion Video Generation

Updated 20 February 2026

MoveBench is a rigorously curated benchmark suite that provides a scalable, high-quality dataset for evaluating motion-controllable video generation.
It integrates a hybrid human+SAM annotation pipeline and CoTracker-based dense point trajectories to ensure precise motion and segmentation data.
Benchmark results demonstrate improvements in visual fidelity and motion accuracy across both single- and multi-object scenarios.

MoveBench is a rigorously curated benchmark suite designed to support evaluation and research in motion-controllable video generation. Built to address limitations of previous benchmarks—specifically insufficient scale, short clip duration, restricted content diversity, and inadequate motion annotation quality—MoveBench provides a high-resolution, diverse, and finely annotated dataset for quantitative and qualitative assessment of generative models capable of controllable motion synthesis (Chu et al., 9 Dec 2025).

1. Motivation and Rationale

Existing benchmarks for motion-controllable video generation, such as DAVIS, VIPSeg, and MagicBench, typically comprise few hundred clips, short video durations (1–2 seconds), limited object or action diversity, and sparse or noisy annotations. These deficiencies inhibit rigorous evaluation, especially for models targeting fine-grained, precise, and scalable motion control. MoveBench is specifically designed to overcome these limitations:

Sufficient scale: 1,018 videos, each 5 seconds (81 frames at 16 fps)
Uniform high resolution: 480 × 832 pixels
Comprehensive coverage: 54 distinct content categories, each with 15–25 balanced exemplars
Dual-mode high-quality annotation: both dense point trajectories and sparse segmentation masks
Annotation accuracy: hybrid human + Segment-Anything Model (SAM) pipeline This benchmark is distributed under a free license and includes all necessary assets for reproducible, cross-comparable evaluation of motion-controllable video generative models (Chu et al., 9 Dec 2025).

2. Dataset Construction and Annotation Pipeline

All videos in MoveBench are sourced from the Pexels-400k collection (approx. 400,000 free-license videos). The dataset curation proceeds in four stages:

Quality scoring: An expert-labeled set of 1,000 clips is used to train a visual-quality classifier. This classifier filters out low-quality content.
Temporal consistency: SigLIP features are extracted frame-wise. Clips where the mean cosine similarity between the first and subsequent frames falls below a set threshold are excluded.
Frame sampling and resizing: Videos are uniformly cropped/resized to 480 × 832 resolution and 81 frames per clip (5 seconds).
Content clustering and manual selection: Mean SigLIP embeddings from 16 frames per clip are clustered via K-means (K = 54). Within each cluster, 15–25 videos are manually selected for category diversity and representativeness.

Annotation follows a hybrid human+SAM procedure:

Annotators select objects in the first frame; SAM automatically generates an initial segmentation mask.
Optional negative point feedback is used to refine over-segmented regions.
All moving objects are annotated (with unique IDs per object).
Automatic trajectory extraction: CoTracker estimates 1–1,024 dense 2D point trajectories per mask across all frames.

Each video in MoveBench is accompanied by:

One or more per-object segmentation masks (PNG or RLE format)
Dense point trajectories $p_k[n]\in \mathbb{R}^2$ for $k\in \{1,…,K\}$ and $n\in\{0,…,T\}$ A subset of 192 videos is explicitly annotated for simultaneous multi-object motion, enabling multi-object motion control evaluation tasks.

3. Evaluation Protocols and Tasks

MoveBench is designed as a pure test benchmark; no official train/val/test splits are provided. The supported evaluation regimes include:

Single-object trajectory-guided video generation
Multi-object trajectory-guided generation (192-clip subset)
Sparse control generation: using masks or bounding box conditions
Human 2-alternative forced choice (2AFC) studies focused on motion accuracy, motion quality, and overall visual quality For cross-dataset validation, results on MoveBench may be compared with those on DAVIS under analogous protocols.

For each generative model under test, the protocol is:

Generate 5-second videos at 480 × 832 using provided initial frame and trajectory/mask conditions
Compute all quantitative metrics over all 1,018 clips, and on the multi-object subset
Optionally conduct human 2AFC studies to supplement quantitative metrics

4. Quantitative Evaluation Metrics

MoveBench adopts established and specialized metrics to capture model performance across both perceptual fidelity and motion controllability axes:

Visual Fidelity and Temporal Consistency:

Fréchet Inception Distance (FID): Standard image-domain distribution similarity.
Fréchet Video Distance (FVD) [Unterthiner et al., ICLR’19]:

$\mathrm{FVD}(X,Y) = \|\mu_X - \mu_Y\|^2 + \mathrm{Tr}\!\bigl(\Sigma_X + \Sigma_Y - 2(\Sigma_X \Sigma_Y)^{1/2}\bigr)$

Peak Signal-to-Noise Ratio (PSNR):

$\mathrm{PSNR}(X,Y) = 10 \log_{10} \frac{L^2}{\mathrm{MSE}(X,Y)},\quad \mathrm{MSE}(X,Y)=\tfrac1N\sum_i(X_i - Y_i)^2$

Structural Similarity Index (SSIM) [Wang et al., TIP’04]:

$\mathrm{SSIM}(X,Y) = \frac{(2\mu_X\mu_Y + C_1)(2\sigma_{XY} + C_2)} {(\mu_X^2 + \mu_Y^2 + C_1)(\sigma_X^2 + \sigma_Y^2 + C_2)}$

Motion Accuracy:

End-Point Error (EPE):

$\mathrm{EPE} = \frac{1}{K(T+1)} \sum_{k=1}^{K}\sum_{n=0}^T \bigl\|p_k[n] - \hat p_k[n]\bigr\|_2$

where $p_k[n]$ are ground-truth point tracks and $\hat{p}_k[n]$ are estimated from the generated video frames.

Human Study Protocols: Human 2AFC studies on motion accuracy, motion quality, and visual quality complement the above quantitative metrics.

5. Benchmark Results and Baseline Comparisons

Benchmarking on MoveBench yields the following comparative results (lower FID, FVD, EPE; higher PSNR, SSIM are preferable):

Table 1: Single-object results (entire MoveBench)

Method	FID	FVD	PSNR	SSIM	EPE
ImageConductor	34.5	424.0	13.4	0.49	15.6
LeviTor	18.1	98.8	15.6	0.54	3.4
Tora	22.5	100.4	15.7	0.55	3.3
MagicMotion	17.5	96.7	14.9	0.56	3.2
Wan-Move	12.2	83.5	17.8	0.64	2.6

Table 2: Multi-object subset (192 videos)

Method	FID	FVD	PSNR	SSIM	EPE
ImageConductor	77.5	764.5	13.9	0.51	9.8
Tora	53.2	350.0	14.5	0.54	3.5
Wan-Move	28.8	226.3	16.7	0.62	2.2

Table 3: Human 2AFC win rates (reference: Wan-Move)

Method	Motion Acc.	Motion Qual.	Visual Qual.
LeviTor	98.2%	98.0%	98.8%
Tora	96.2%	93.8%	98.4%
MagicMotion	89.4%	96.4%	98.2%
Kling 1.5 Pro	47.8%	53.4%	50.2%
Wan-Move	—	—	—

MoveBench enables fine-grained and statistically rigorous comparison of generative models with respect to both visual realism and explicit motion controllability (Chu et al., 9 Dec 2025).

6. Usage, Extensibility, and Reporting

To use MoveBench:

Download the full benchmark suite (videos and annotations).
For each tested model, generate videos at the required resolution and duration, conditioned on provided masks or point trajectories.
Report FID, FVD, PSNR, SSIM, and EPE on the 1,018-clip test set and the 192-video multi-object subset.
Optionally extend the analysis with human 2AFC studies on motion and visual quality.
For custom motion control formats (e.g., skeletal, flow), convert annotation to the provided mask/trajectory types.
Use the 54-category taxonomy to analyze and report per-category performance, highlighting domain-specific strengths and weaknesses and ensuring balanced coverage.

7. Broader Context and Impact

MoveBench represents a substantial advancement in the benchmarking of motion-controllable video generation. By providing dense, dual-modality motion annotation and comprehensive scenario diversity at scale, it enables robust, reproducible, and fine-grained assessment across a wide spectrum of controllable generation tasks, facilitating accelerated progress in precision video synthesis research (Chu et al., 9 Dec 2025). Its availability under a free license and the inclusion of scalable, high-quality annotation workflows (e.g., hybrid human+SAM, CoTracker) set a new standard for the field.

A plausible implication is that MoveBench’s dual-mode annotation (trajectories and masks) may support cross-comparison of both dense and sparse motion control paradigms within a single unified protocol, supporting broader generalization claims in future model developments.

Markdown Report Issue Upgrade to Chat

References (1)

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MoveBench Benchmark.