MotionEdit-Bench: Motion Editing Benchmark
- MotionEdit-Bench is a benchmark that targets motion-centric transformations, challenging models to produce physically plausible and kinematically faithful edits.
- It unifies video and image editing evaluation by categorizing tasks into spatial and motion-specific edits, supported by manually annotated datasets and diverse editing scenarios.
- The benchmark employs both automatic and human evaluation metrics, revealing that explicit motion conditioning and modular pipelines significantly improve temporal edit success.
MotionEdit-Bench is a rigorously designed benchmark for evaluating motion-centric edit capabilities in video and image editing models. Unlike prior benchmarks focused predominantly on spatial edits (style transfer, appearance, background swaps, object manipulation), MotionEdit-Bench emphasizes transformations requiring explicit change of motion, pose, or interaction—challenging models to produce physically plausible, kinematically faithful edits while preserving subject and background integrity. The benchmark has been independently introduced for both video editing (Yan et al., 2023) and motion-centric image editing (Wan et al., 11 Dec 2025), providing broad coverage and technical depth for research in end-to-end generative editing systems.
1. Motivation and Scope
MotionEdit-Bench was created to address the limitations of existing evaluation protocols that predominantly stress spatial edits, neglecting the domain of motion manipulations—such as changing trajectories, altering pose or action, or reconfiguring interactions between entities. This benchmark unifies spatial and motion editing tasks under a single framework, enabling standardized and comparative assessment of advanced editing systems. In video editing, it includes diverse tasks such as object replacement, background modification, style transfer, and temporal motion edits. In image editing, it precisely targets modifications to subject actions and interactions, requiring models to infer and synthesize plausible motion transformations while retaining appearance and physical realism. The practical significance extends to downstream tasks in animation, frame-controlled video synthesis, and automated post-production.
2. Dataset Construction and Task Taxonomy
Video Editing (MotionEdit-Bench, (Yan et al., 2023))
The benchmark consists of 271 editing tasks applied to 81 unique videos from three sources:
- LOVEU-TGVE: 35 videos, 140 edits
- Dreamix authors’ set: 9 videos, 14 edits
- Custom YouTube-8M subset: 37 videos, 117 edits, curated for motion and multi-motion edits
Tasks are divided into six categories:
- Style edits (appearance/art style change)
- Background edits (environment manipulation)
- Object edits (add/remove/swap)
- Motion edits (temporal dynamics)
- Multi-spatial edits (compound spatial changes)
- Multi-motion edits (compound spatial + motion changes)
Each (video, prompt) pair is manually aligned and labeled. The benchmark is strictly for held-out evaluation; no train/validation split is defined. Each method, trained on its own corpus, is scored on all 271 tasks.
Image Editing (MotionEdit-Bench, (Wan et al., 11 Dec 2025))
Built from ShareVeo3 and KlingAI Video Dataset, the process mines high-resolution video sequences for frame pairs exhibiting nontrivial action changes. Criteria include:
- Setting consistency
- Motion change
- Subject integrity
A frozen Gemini MLLM assesses candidate pairs; successful examples are paired with natural language instructions specifying the edit. The dataset comprises 10,157 triplets (input image, edit instruction, ground-truth image), with a 9,142/1,015 train/test split. Motions are manually categorized:
- Pose/posture changes
- Locomotion/distance shifts
- Object state/form changes
- Orientation/viewpoint alterations
- Subject–object interactions
- Inter-subject interactions
Motion magnitude analysis using optical flow reveals that average edit displacement is substantially greater than previous motion editing datasets (∼3× larger than UltraEdit, ∼6× larger than MagicBrush).
3. Evaluation Metrics
MotionEdit-Bench employs complementary automatic and preference-based metrics:
Video Editing Metrics (Yan et al., 2023)
- Video–Video Similarity (Mₛᵢₘ): Cosine similarity of VideoCLIP embeddings between source and edited video.
- Text–Video Directional Similarity (M_dᵢᵣ): Measures alignment between the feature direction of textual caption edit and video edit.
- Geometric Average (M_gₑₒ): Combines fidelity and directional accuracy via .
- No separate explicit temporal-consistency metrics; consistency is judged in human evaluations.
Image Editing Metrics (Wan et al., 11 Dec 2025)
- Generative (MLLM-Based): Gemini MLLM rates fidelity, preservation, coherence (range [0,5]), aggregated as Overall Score.
- Frechet Inception Distance (FID): Quantifies feature distribution distance between edited and ground-truth images.
- Motion Alignment Score (MAS): Directly evaluates pixelwise optical-flow between generated and ground-truth motion, combining magnitude, direction, and anti-collapse terms. Higher MAS indicates closer kinematic fidelity.
- Preference Win-Rate: Fraction of samples where one model's output is preferred by raters or the MLLM.
4. Comparative Results and Scientific Insights
Video Editing (Yan et al., 2023)
MotionEdit-Bench enables head-to-head comparison across six video editing models: Dreamix, Tune-a-Video, MasaCtrl, TokenFlow, Gen-1, VideoComposer, and MoCA. Human preference win-rates demonstrate that MoCA achieves the highest overall and per-category scores, especially on motion edits (Motion: 81–99% win, compared with 40–86% for baselines). Automatic CLIP-based scores corroborate human judgments:
- MoCA: ,
- Baselines: 0.090–0.128, 0.231–0.278
A critical insight is that explicit motion conditioning and a modular image-edit–motion-animation pipeline substantially improve edit success for temporal tasks. Automatic metrics correlate weakly (Spearman ~0.20) with human judgment, particularly for motion aspects, emphasizing continued reliance on human studies.
Image Editing (Wan et al., 11 Dec 2025)
Diffusion-based appearance editors underperform, with Overall ≈ 1–2, MAS < 50, Win Rate < 30%. Advanced systems (Step1X-Edit, BAGEL, UniWorld-V1, FLUX.1, Qwen-Image-Edit) score Overall ≈ 4–4.6, MAS ≈ 53–56, Win Rate ≈ 58–73. Systematic errors include “editing inertia,” incorrect limb movement, structural distortion under large edits, and inconsistent scene preservation.
Introduction of MotionNFT—a post-training framework blending optical-flow kinematic rewards with semantic scoring—yields quantifiable gains in MAS, Overall, and Win Rate (e.g., FLUX.1 Win Rate rises +12.4% to 65.16). This suggests that incorporating explicit motion fidelity rewards is vital for accurate image-edit synthesis when motion cues are internal to the instruction.
5. Benchmark Protocols and Annotation Practices
Evaluation protocols are strictly standardized:
- Video: All models are assessed on the same held-out set of 271 tasks with paired human and automatic evaluation. Human raters on Amazon Mechanical Turk select preferred edits and tag reasons (prompt alignment, consistency, both).
- Image: Test examples present single-frame, instruction-driven edit challenges. Metrics and comparisons are stratified by motion type and scene context.
All (video, prompt) pairs are manually annotated for alignment and edit category. Image edit datasets require MLLM-driven instruction generation and filtering for nontrivial, physically plausible motion.
6. Limitations and Scientific Lessons
MotionEdit-Bench reveals persistent challenges:
- Most video editing models over-anchor to original motion, limiting successful trajectory edits.
- Image editing models typically lack explicit mechanisms for kinematic consistency, leading to misaligned or unrealistic edits.
- Human assessments remain essential; automatic metrics provide incomplete proxies for edit quality, particularly regarding physical plausibility and complex spatial-temporal dynamics.
A plausible implication is that future progress in motion-centric editing will depend on richer pretraining corpora, improved reward shaping for temporal fidelity, and modular architectures capable of disentangling appearance from physical dynamics.
7. Impact and Research Significance
MotionEdit-Bench establishes a foundational standard for interdisciplinary research in video and motion-centric image editing. By centering the evaluation on tasks that integrate semantic and physical transformations—while carefully diagnosing metrics and failure modes—it enables reproducible, discriminative benchmarking. Its adoption in both video and image editing research enforces methodological rigor and informs the development of advanced editing systems capable of synthesizing motion-aware, physically coherent outputs (Yan et al., 2023, Wan et al., 11 Dec 2025).