MotionV2V: Precise Video Motion Editing
- MotionV2V is a framework for precise video motion editing that directly manipulates sparse point trajectories for controllable, localized modifications.
- It introduces motion counterfactuals and leverages a motion-conditioned diffusion network to alter videos while preserving overall appearance.
- Experimental results show superior content preservation, motion fidelity, and user preference compared to leading baseline methods.
MotionV2V is a framework for precise and general video motion editing, grounded in the direct manipulation of sparse motion trajectories. Unlike prior text-to-video and image animation models, MotionV2V formulates video motion editing as the task of altering explicit trajectories within existing videos, enabling controllable and localized modifications that propagate naturally from arbitrary timestamps. It introduces the concept of "motion counterfactuals," pairing the original video with synthetically altered motion while retaining appearance, and employs a motion-conditioned diffusion network to synthesize realistic edited outputs. Extensive experimental evaluation demonstrates its superiority relative to contemporary baselines in fidelity, motion following, and overall user preference (Burgert et al., 25 Nov 2025).
1. Representation of Trajectories and Motion Edits
MotionV2V represents motion using sparse point trajectories tracked through video clips. For point at frame :
- The input position is .
- Trajectories are aggregated as for the input and for the edited (target) motion.
The "motion edit" is defined as the per-point, per-frame deviation:
or in matrix form.
Regularization and constraints during training include:
- Dropout on trajectory channels: low for conditioning tracks (from counterfactuals) and higher for target tracks (forcing generalization to incomplete signals).
- Inference-time random "jitter" px added to per point and frame to discourage identity copying.
- The number of tracked points at inference is typically capped at , as larger impairs adherence to specified edits.
2. Synthesis of Motion Counterfactuals
To enable supervised learning of motion-conditioned generation, MotionV2V constructs a counterfactual training dataset of appearance-consistent yet motion-diverse video pairs. The pipeline is as follows for raw input (length ), clip length :
- Sample clip start .
- Compute target video .
- For counterfactual , sample frames , ; choose either:
- Frame-interpolation: Synthesize frames via a video diffusion model conditioned on , , and a text prompt (e.g., "make the person twirl").
- Temporal-resampling: Uniformly sample frames from (potentially reversed).
- Randomly sample initial tracked points .
- Run TAPNext tracker on and to produce and , respectively.
- Apply consistent random spatial augmentations (sliding crops, rotations, scale ) to and .
- Rasterize tracks into F%%%%3435%%%%HW stacks of colored Gaussian blobs (px) using colors.
Parameters include: frames, input resolution , training from counterfactual/target pairs sampled from raw videos.
3. Motion-Conditioned Diffusion Network
MotionV2V builds on the CogVideoX-5B DiT (Denoising Transformers) text-to-video backbone and augments it with a ControlNet-inspired branch to integrate motion cues.
Conditioning Inputs (in latent space)
- Noisy video:
- Counterfactual video: (latent)
- Counterfactual tracks: (latent)
- Target tracks: (latent)
- Optional text prompt (omitted from loss equations).
Preprocessing
- All RGB videos (F%%%%4849%%%%480720) are encoded via a 3D causal VAE to latents of shape lat6090, with .
Control Branch
- The first 18 DiT transformer blocks are duplicated for the control branch.
- The three conditioning videos are patchified into 48 spatiotemporal channels.
- In each block, control tokens are processed via zero-initialized channelwise MLPs, then added to the main branch tokens, as in ControlNet.
- The DiT weights remain frozen; only the control branch is trained.
Diffusion Loss
Let , , target latent , and noised :
4. Training Regimen and Inference Constraints
- Hardware: 8×NVIDIA H100 GPUs, training time ≈1 week.
- Optimizer: Adam, learning rate , batch size 32, 15,000 total iterations.
- The noise schedule follows standard latent diffusion ( as in CogVideoX).
- For data variety, each sample randomizes the edit start and chooses either interpolation or resampling mode.
- Target track dropouts are higher to encourage generalization; inference-time jitter avoids degenerate identity copying.
- At inference, for optimal edit fidelity.
5. Experimental Results
A. User Study (4-way, head-to-head)
- 20 diverse videos tested (object motion, camera, time, mid-stream edits).
- 41 participants.
- Baselines: ATI (WAN2.1-based), ReVideo, Go-with-the-Flow.
- Evaluation criteria: content preservation (Q1), motion fidelity (Q2), overall quality (Q3).
| Method | Q1 (Content) | Q2 (Motion) | Q3 (Overall) |
|---|---|---|---|
| Ours | 70% | 71% | 69% |
| ATI | 24% | 24% | 25% |
| ReVideo | 1% | 2% | 1% |
| GWTF | 5% | 3% | 5% |
B. Quantitative Photometric Error
- Test set: 100 videos, split at midpoint, second half reversed for comparison.
- Metrics: framewise Lâ‚‚, SSIM, LPIPS against ground-truth.
| Method | L₂ (↓) | SSIM (↑) | LPIPS (↓) |
|---|---|---|---|
| Ours | 0.024 | 0.098 | 0.031 |
| ATI | 0.038 | 0.094 | 0.072 |
| Go-with-the-Flow | 0.067 | 0.089 | 0.088 |
| ReVideo | 0.096 | 0.080 | 0.106 |
C. Qualitative and Ablation Studies
- Maintains object appearance and follows user-specified, even arbitrary, motion trajectories across eight challenging edit scenarios.
- Baseline I2V (image-to-video) methods frequently hallucinate missing content or replicate undesired background elements.
Key ablations:
- Inference jitter prevents degenerate copying (e.g., averting unwanted "cloning" in repetitive actions).
- High target-track dropout enhances generalization.
- Overly dense points reduce edit fidelity during inference.
6. Comparative Analysis
MotionV2V demonstrates systematic advantages over ATI, ReVideo, and Go-with-the-Flow baselines in:
- Content preservation (appearance, background, and spatial layout).
- Motion following along user-defined trajectories.
- Overall user preference by substantial margins in forced-choice studies.
Significant contextual outcomes:
- Enables edits from any temporal anchor and supports diverse edits such as mid-stream trajectory changes, off-frame object reentrance, and time manipulation.
- Maintains high fidelity in appearance, attributed to the use of a strong appearance prior (CogVideoX-5B) and explicit trajectory control.
7. Limitations and Prospects
Observed limitations include:
- Subject drift in extremely long or complex sequential edits, ascribed to foundational model constraints.
- Reliance on diffusion-generated counterfactuals as opposed to idealized motion ground truth.
Prospective directions include:
- Leveraging synthetic 3D datasets with known true motion.
- Reduction in necessary user-provided or automatically detected control points.
- Advances in VAE and denoising backbones to enable iterative edits without accretive drift.
A plausible implication is that future work on trajectory representations and model architectures may substantially broaden the operational scope and robustness of trajectory-based video editing frameworks (Burgert et al., 25 Nov 2025).