MoveBench: Benchmark for Motion Video Generation
- MoveBench is a rigorously curated benchmark suite that provides a scalable, high-quality dataset for evaluating motion-controllable video generation.
- It integrates a hybrid human+SAM annotation pipeline and CoTracker-based dense point trajectories to ensure precise motion and segmentation data.
- Benchmark results demonstrate improvements in visual fidelity and motion accuracy across both single- and multi-object scenarios.
MoveBench is a rigorously curated benchmark suite designed to support evaluation and research in motion-controllable video generation. Built to address limitations of previous benchmarks—specifically insufficient scale, short clip duration, restricted content diversity, and inadequate motion annotation quality—MoveBench provides a high-resolution, diverse, and finely annotated dataset for quantitative and qualitative assessment of generative models capable of controllable motion synthesis (Chu et al., 9 Dec 2025).
1. Motivation and Rationale
Existing benchmarks for motion-controllable video generation, such as DAVIS, VIPSeg, and MagicBench, typically comprise few hundred clips, short video durations (1–2 seconds), limited object or action diversity, and sparse or noisy annotations. These deficiencies inhibit rigorous evaluation, especially for models targeting fine-grained, precise, and scalable motion control. MoveBench is specifically designed to overcome these limitations:
- Sufficient scale: 1,018 videos, each 5 seconds (81 frames at 16 fps)
- Uniform high resolution: 480 × 832 pixels
- Comprehensive coverage: 54 distinct content categories, each with 15–25 balanced exemplars
- Dual-mode high-quality annotation: both dense point trajectories and sparse segmentation masks
- Annotation accuracy: hybrid human + Segment-Anything Model (SAM) pipeline This benchmark is distributed under a free license and includes all necessary assets for reproducible, cross-comparable evaluation of motion-controllable video generative models (Chu et al., 9 Dec 2025).
2. Dataset Construction and Annotation Pipeline
All videos in MoveBench are sourced from the Pexels-400k collection (approx. 400,000 free-license videos). The dataset curation proceeds in four stages:
- Quality scoring: An expert-labeled set of 1,000 clips is used to train a visual-quality classifier. This classifier filters out low-quality content.
- Temporal consistency: SigLIP features are extracted frame-wise. Clips where the mean cosine similarity between the first and subsequent frames falls below a set threshold are excluded.
- Frame sampling and resizing: Videos are uniformly cropped/resized to 480 × 832 resolution and 81 frames per clip (5 seconds).
- Content clustering and manual selection: Mean SigLIP embeddings from 16 frames per clip are clustered via K-means (K = 54). Within each cluster, 15–25 videos are manually selected for category diversity and representativeness.
Annotation follows a hybrid human+SAM procedure:
- Annotators select objects in the first frame; SAM automatically generates an initial segmentation mask.
- Optional negative point feedback is used to refine over-segmented regions.
- All moving objects are annotated (with unique IDs per object).
- Automatic trajectory extraction: CoTracker estimates 1–1,024 dense 2D point trajectories per mask across all frames.
Each video in MoveBench is accompanied by:
- One or more per-object segmentation masks (PNG or RLE format)
- Dense point trajectories for and A subset of 192 videos is explicitly annotated for simultaneous multi-object motion, enabling multi-object motion control evaluation tasks.
3. Evaluation Protocols and Tasks
MoveBench is designed as a pure test benchmark; no official train/val/test splits are provided. The supported evaluation regimes include:
- Single-object trajectory-guided video generation
- Multi-object trajectory-guided generation (192-clip subset)
- Sparse control generation: using masks or bounding box conditions
- Human 2-alternative forced choice (2AFC) studies focused on motion accuracy, motion quality, and overall visual quality For cross-dataset validation, results on MoveBench may be compared with those on DAVIS under analogous protocols.
For each generative model under test, the protocol is:
- Generate 5-second videos at 480 × 832 using provided initial frame and trajectory/mask conditions
- Compute all quantitative metrics over all 1,018 clips, and on the multi-object subset
- Optionally conduct human 2AFC studies to supplement quantitative metrics
4. Quantitative Evaluation Metrics
MoveBench adopts established and specialized metrics to capture model performance across both perceptual fidelity and motion controllability axes:
Visual Fidelity and Temporal Consistency:
- Fréchet Inception Distance (FID): Standard image-domain distribution similarity.
- Fréchet Video Distance (FVD) [Unterthiner et al., ICLR’19]:
- Peak Signal-to-Noise Ratio (PSNR):
- Structural Similarity Index (SSIM) [Wang et al., TIP’04]:
Motion Accuracy:
- End-Point Error (EPE):
where are ground-truth point tracks and are estimated from the generated video frames.
Human Study Protocols: Human 2AFC studies on motion accuracy, motion quality, and visual quality complement the above quantitative metrics.
5. Benchmark Results and Baseline Comparisons
Benchmarking on MoveBench yields the following comparative results (lower FID, FVD, EPE; higher PSNR, SSIM are preferable):
Table 1: Single-object results (entire MoveBench)
| Method | FID | FVD | PSNR | SSIM | EPE |
|---|---|---|---|---|---|
| ImageConductor | 34.5 | 424.0 | 13.4 | 0.49 | 15.6 |
| LeviTor | 18.1 | 98.8 | 15.6 | 0.54 | 3.4 |
| Tora | 22.5 | 100.4 | 15.7 | 0.55 | 3.3 |
| MagicMotion | 17.5 | 96.7 | 14.9 | 0.56 | 3.2 |
| Wan-Move | 12.2 | 83.5 | 17.8 | 0.64 | 2.6 |
Table 2: Multi-object subset (192 videos)
| Method | FID | FVD | PSNR | SSIM | EPE |
|---|---|---|---|---|---|
| ImageConductor | 77.5 | 764.5 | 13.9 | 0.51 | 9.8 |
| Tora | 53.2 | 350.0 | 14.5 | 0.54 | 3.5 |
| Wan-Move | 28.8 | 226.3 | 16.7 | 0.62 | 2.2 |
Table 3: Human 2AFC win rates (reference: Wan-Move)
| Method | Motion Acc. | Motion Qual. | Visual Qual. |
|---|---|---|---|
| LeviTor | 98.2% | 98.0% | 98.8% |
| Tora | 96.2% | 93.8% | 98.4% |
| MagicMotion | 89.4% | 96.4% | 98.2% |
| Kling 1.5 Pro | 47.8% | 53.4% | 50.2% |
| Wan-Move | — | — | — |
MoveBench enables fine-grained and statistically rigorous comparison of generative models with respect to both visual realism and explicit motion controllability (Chu et al., 9 Dec 2025).
6. Usage, Extensibility, and Reporting
To use MoveBench:
- Download the full benchmark suite (videos and annotations).
- For each tested model, generate videos at the required resolution and duration, conditioned on provided masks or point trajectories.
- Report FID, FVD, PSNR, SSIM, and EPE on the 1,018-clip test set and the 192-video multi-object subset.
- Optionally extend the analysis with human 2AFC studies on motion and visual quality.
- For custom motion control formats (e.g., skeletal, flow), convert annotation to the provided mask/trajectory types.
- Use the 54-category taxonomy to analyze and report per-category performance, highlighting domain-specific strengths and weaknesses and ensuring balanced coverage.
7. Broader Context and Impact
MoveBench represents a substantial advancement in the benchmarking of motion-controllable video generation. By providing dense, dual-modality motion annotation and comprehensive scenario diversity at scale, it enables robust, reproducible, and fine-grained assessment across a wide spectrum of controllable generation tasks, facilitating accelerated progress in precision video synthesis research (Chu et al., 9 Dec 2025). Its availability under a free license and the inclusion of scalable, high-quality annotation workflows (e.g., hybrid human+SAM, CoTracker) set a new standard for the field.
A plausible implication is that MoveBench’s dual-mode annotation (trajectories and masks) may support cross-comparison of both dense and sparse motion control paradigms within a single unified protocol, supporting broader generalization claims in future model developments.