Papers
Topics
Authors
Recent
Search
2000 character limit reached

MotionV2V: Precise Video Motion Editing

Updated 9 December 2025
  • MotionV2V is a framework for precise video motion editing that directly manipulates sparse point trajectories for controllable, localized modifications.
  • It introduces motion counterfactuals and leverages a motion-conditioned diffusion network to alter videos while preserving overall appearance.
  • Experimental results show superior content preservation, motion fidelity, and user preference compared to leading baseline methods.

MotionV2V is a framework for precise and general video motion editing, grounded in the direct manipulation of sparse motion trajectories. Unlike prior text-to-video and image animation models, MotionV2V formulates video motion editing as the task of altering explicit trajectories within existing videos, enabling controllable and localized modifications that propagate naturally from arbitrary timestamps. It introduces the concept of "motion counterfactuals," pairing the original video with synthetically altered motion while retaining appearance, and employs a motion-conditioned diffusion network to synthesize realistic edited outputs. Extensive experimental evaluation demonstrates its superiority relative to contemporary baselines in fidelity, motion following, and overall user preference (Burgert et al., 25 Nov 2025).

1. Representation of Trajectories and Motion Edits

MotionV2V represents motion using sparse point trajectories tracked through video clips. For point i∈{1,…,N}i \in \{1, \ldots, N\} at frame t∈{1,…,F}t \in \{1, \ldots, F\}:

  • The input position is xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^2.
  • Trajectories are aggregated as Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2} for the input and Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2} for the edited (target) motion.

The "motion edit" is defined as the per-point, per-frame deviation:

Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),

or Δ=Xtgt−Xin\Delta = X_\mathrm{tgt} - X_\mathrm{in} in matrix form.

Regularization and constraints during training include:

  • Dropout on trajectory channels: low for conditioning tracks (from counterfactuals) and higher for target tracks (forcing generalization to incomplete signals).
  • Inference-time random "jitter" εi(t)∼Uniform(−2,2)\varepsilon^i(t) \sim \text{Uniform}(-2,2) px added to (x,y)(x,y) per point and frame to discourage identity copying.
  • The number of tracked points at inference is typically capped at N≈20N \approx 20, as larger t∈{1,…,F}t \in \{1, \ldots, F\}0 impairs adherence to specified edits.

2. Synthesis of Motion Counterfactuals

To enable supervised learning of motion-conditioned generation, MotionV2V constructs a counterfactual training dataset of appearance-consistent yet motion-diverse video pairs. The pipeline is as follows for raw input t∈{1,…,F}t \in \{1, \ldots, F\}1 (length t∈{1,…,F}t \in \{1, \ldots, F\}2), clip length t∈{1,…,F}t \in \{1, \ldots, F\}3:

  1. Sample clip start t∈{1,…,F}t \in \{1, \ldots, F\}4.
  2. Compute target video t∈{1,…,F}t \in \{1, \ldots, F\}5.
  3. For counterfactual t∈{1,…,F}t \in \{1, \ldots, F\}6, sample frames t∈{1,…,F}t \in \{1, \ldots, F\}7, t∈{1,…,F}t \in \{1, \ldots, F\}8; choose either:
    • Frame-interpolation: Synthesize t∈{1,…,F}t \in \{1, \ldots, F\}9 frames via a video diffusion model conditioned on xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^20, xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^21, and a text prompt (e.g., "make the person twirl").
    • Temporal-resampling: Uniformly sample xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^22 frames from xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^23 (potentially reversed).
  4. Randomly sample xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^24 initial tracked points xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^25.
  5. Run TAPNext tracker on xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^26 and xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^27 to produce xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^28 and xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^29, respectively.
  6. Apply consistent random spatial augmentations (sliding crops, Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}0 rotations, scale Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}1) to Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}2 and Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}3.
  7. Rasterize tracks into F%%%%34Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}35%%%%HXin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}6W stacks of colored Gaussian blobs (Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}7px) using Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}8 colors.

Parameters include: Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}9 frames, input resolution Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2}0, training from Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2}1 counterfactual/target pairs sampled from Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2}2 raw videos.

3. Motion-Conditioned Diffusion Network

MotionV2V builds on the CogVideoX-5B DiT (Denoising Transformers) text-to-video backbone and augments it with a ControlNet-inspired branch to integrate motion cues.

Conditioning Inputs (in latent space)

  • Noisy video: Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2}3
  • Counterfactual video: Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2}4 (latent)
  • Counterfactual tracks: Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2}5 (latent)
  • Target tracks: Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2}6 (latent)
  • Optional text prompt Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2}7 (omitted from loss equations).

Preprocessing

  • All RGB videos (F%%%%48Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}49%%%%480Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),0720) are encoded via a 3D causal VAE to latents of shape latΔi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),160Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),290, with Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),3.

Control Branch

  • The first 18 DiT transformer blocks are duplicated for the control branch.
  • The three conditioning videos are patchified into 48 spatiotemporal channels.
  • In each block, control tokens are processed via zero-initialized channelwise MLPs, then added to the main branch tokens, as in ControlNet.
  • The DiT weights remain frozen; only the control branch is trained.

Diffusion Loss

Let Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),4, Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),5, target latent Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),6, and noised Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),7:

Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),8

4. Training Regimen and Inference Constraints

  • Hardware: 8×NVIDIA H100 GPUs, training time ≈1 week.
  • Optimizer: Adam, learning rate Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),9, batch size 32, 15,000 total iterations.
  • The noise schedule follows standard latent diffusion (Δ=Xtgt−Xin\Delta = X_\mathrm{tgt} - X_\mathrm{in}0 as in CogVideoX).
  • For data variety, each sample randomizes the edit start and chooses either interpolation or resampling mode.
  • Target track dropouts are higher to encourage generalization; inference-time jitter avoids degenerate identity copying.
  • At inference, Δ=Xtgt−Xin\Delta = X_\mathrm{tgt} - X_\mathrm{in}1 for optimal edit fidelity.

5. Experimental Results

A. User Study (4-way, head-to-head)

  • 20 diverse videos tested (object motion, camera, time, mid-stream edits).
  • 41 participants.
  • Baselines: ATI (WAN2.1-based), ReVideo, Go-with-the-Flow.
  • Evaluation criteria: content preservation (Q1), motion fidelity (Q2), overall quality (Q3).
Method Q1 (Content) Q2 (Motion) Q3 (Overall)
Ours 70% 71% 69%
ATI 24% 24% 25%
ReVideo 1% 2% 1%
GWTF 5% 3% 5%

B. Quantitative Photometric Error

  • Test set: 100 videos, split at midpoint, second half reversed for comparison.
  • Metrics: framewise Lâ‚‚, SSIM, LPIPS against ground-truth.
Method L₂ (↓) SSIM (↑) LPIPS (↓)
Ours 0.024 0.098 0.031
ATI 0.038 0.094 0.072
Go-with-the-Flow 0.067 0.089 0.088
ReVideo 0.096 0.080 0.106

C. Qualitative and Ablation Studies

  • Maintains object appearance and follows user-specified, even arbitrary, motion trajectories across eight challenging edit scenarios.
  • Baseline I2V (image-to-video) methods frequently hallucinate missing content or replicate undesired background elements.

Key ablations:

  • Inference jitter Δ=Xtgt−Xin\Delta = X_\mathrm{tgt} - X_\mathrm{in}2 prevents degenerate copying (e.g., averting unwanted "cloning" in repetitive actions).
  • High target-track dropout enhances generalization.
  • Overly dense points Δ=Xtgt−Xin\Delta = X_\mathrm{tgt} - X_\mathrm{in}3 reduce edit fidelity during inference.

6. Comparative Analysis

MotionV2V demonstrates systematic advantages over ATI, ReVideo, and Go-with-the-Flow baselines in:

  • Content preservation (appearance, background, and spatial layout).
  • Motion following along user-defined trajectories.
  • Overall user preference by substantial margins in forced-choice studies.

Significant contextual outcomes:

  • Enables edits from any temporal anchor and supports diverse edits such as mid-stream trajectory changes, off-frame object reentrance, and time manipulation.
  • Maintains high fidelity in appearance, attributed to the use of a strong appearance prior (CogVideoX-5B) and explicit trajectory control.

7. Limitations and Prospects

Observed limitations include:

  • Subject drift in extremely long or complex sequential edits, ascribed to foundational model constraints.
  • Reliance on diffusion-generated counterfactuals as opposed to idealized motion ground truth.

Prospective directions include:

  • Leveraging synthetic 3D datasets with known true motion.
  • Reduction in necessary user-provided or automatically detected control points.
  • Advances in VAE and denoising backbones to enable iterative edits without accretive drift.

A plausible implication is that future work on trajectory representations and model architectures may substantially broaden the operational scope and robustness of trajectory-based video editing frameworks (Burgert et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MotionV2V.