Papers
Topics
Authors
Recent
2000 character limit reached

MotionV2V: Precise Video Motion Editing

Updated 9 December 2025
  • MotionV2V is a framework for precise video motion editing that directly manipulates sparse point trajectories for controllable, localized modifications.
  • It introduces motion counterfactuals and leverages a motion-conditioned diffusion network to alter videos while preserving overall appearance.
  • Experimental results show superior content preservation, motion fidelity, and user preference compared to leading baseline methods.

MotionV2V is a framework for precise and general video motion editing, grounded in the direct manipulation of sparse motion trajectories. Unlike prior text-to-video and image animation models, MotionV2V formulates video motion editing as the task of altering explicit trajectories within existing videos, enabling controllable and localized modifications that propagate naturally from arbitrary timestamps. It introduces the concept of "motion counterfactuals," pairing the original video with synthetically altered motion while retaining appearance, and employs a motion-conditioned diffusion network to synthesize realistic edited outputs. Extensive experimental evaluation demonstrates its superiority relative to contemporary baselines in fidelity, motion following, and overall user preference (Burgert et al., 25 Nov 2025).

1. Representation of Trajectories and Motion Edits

MotionV2V represents motion using sparse point trajectories tracked through video clips. For point i∈{1,…,N}i \in \{1, \ldots, N\} at frame t∈{1,…,F}t \in \{1, \ldots, F\}:

  • The input position is xini(t)=(xini(t),yini(t))∈R2\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^2.
  • Trajectories are aggregated as Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2} for the input and Xtgt∈RN×F×2X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2} for the edited (target) motion.

The "motion edit" is defined as the per-point, per-frame deviation:

Δi(t)=xtgti(t)−xini(t),\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),

or Δ=Xtgt−Xin\Delta = X_\mathrm{tgt} - X_\mathrm{in} in matrix form.

Regularization and constraints during training include:

  • Dropout on trajectory channels: low for conditioning tracks (from counterfactuals) and higher for target tracks (forcing generalization to incomplete signals).
  • Inference-time random "jitter" εi(t)∼Uniform(−2,2)\varepsilon^i(t) \sim \text{Uniform}(-2,2) px added to (x,y)(x,y) per point and frame to discourage identity copying.
  • The number of tracked points at inference is typically capped at N≈20N \approx 20, as larger NN impairs adherence to specified edits.

2. Synthesis of Motion Counterfactuals

To enable supervised learning of motion-conditioned generation, MotionV2V constructs a counterfactual training dataset of appearance-consistent yet motion-diverse video pairs. The pipeline is as follows for raw input VrealV_\mathrm{real} (length TT), clip length FF:

  1. Sample clip start s∼Uniform(0,T−F)s \sim \text{Uniform}(0,T{-}F).
  2. Compute target video Vtgt=Vreal[s:s+F−1]V_\mathrm{tgt} = V_\mathrm{real}[s:s+F{-}1].
  3. For counterfactual VcfV_\mathrm{cf}, sample frames a,b∼Uniform(0,T−1)a,b \sim \text{Uniform}(0,T{-}1), a≠ba \ne b; choose either:
    • Frame-interpolation: Synthesize FF frames via a video diffusion model conditioned on Vreal[a]V_\mathrm{real}[a], Vreal[b]V_\mathrm{real}[b], and a text prompt (e.g., "make the person twirl").
    • Temporal-resampling: Uniformly sample FF frames from Vreal[a:b]V_\mathrm{real}[a:b] (potentially reversed).
  4. Randomly sample N∼Uniform(1,64)N \sim \text{Uniform}(1,64) initial tracked points (ti,xi,yi)(t^i, x^i, y^i).
  5. Run TAPNext tracker on VtgtV_\mathrm{tgt} and VcfV_\mathrm{cf} to produce XtgtX_\mathrm{tgt} and XcfX_\mathrm{cf}, respectively.
  6. Apply consistent random spatial augmentations (sliding crops, ±15∘\pm 15^\circ rotations, scale ∈[0.8,1.2]\in [0.8, 1.2]) to VcfV_\mathrm{cf} and XcfX_\mathrm{cf}.
  7. Rasterize tracks into F%%%%34Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}35%%%%H×\timesW stacks of colored Gaussian blobs (σ=10\sigma=10px) using NN colors.

Parameters include: F=49F = 49 frames, input resolution 480×720480 \times 720, training from 100,000100{,}000 counterfactual/target pairs sampled from 500,000500{,}000 raw videos.

3. Motion-Conditioned Diffusion Network

MotionV2V builds on the CogVideoX-5B DiT (Denoising Transformers) text-to-video backbone and augments it with a ControlNet-inspired branch to integrate motion cues.

Conditioning Inputs (in latent space)

  • Noisy video: zt∈Rlat×H′×W′z_t \in \mathbb{R}^{\text{lat} \times H' \times W'}
  • Counterfactual video: VcfV_\mathrm{cf} (latent)
  • Counterfactual tracks: McfM_\mathrm{cf} (latent)
  • Target tracks: MtgtM_\mathrm{tgt} (latent)
  • Optional text prompt yy (omitted from loss equations).

Preprocessing

  • All RGB videos (F%%%%48Xin∈RN×F×2X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}49%%%%480×\times720) are encoded via a 3D causal VAE to latents of shape lat×\times60×\times90, with lat=(F−1)/4+1=13\text{lat} = (F-1)/4+1 = 13.

Control Branch

  • The first 18 DiT transformer blocks are duplicated for the control branch.
  • The three conditioning videos are patchified into 48 spatiotemporal channels.
  • In each block, control tokens are processed via zero-initialized channelwise MLPs, then added to the main branch tokens, as in ControlNet.
  • The DiT weights remain frozen; only the control branch is trained.

Diffusion Loss

Let ε∼N(0,I)\varepsilon \sim \mathcal{N}(0,I), t∼Uniform({1,…,T})t \sim \text{Uniform}(\{1, \ldots, T\}), target latent z0z_0, and noised zt=αtz0+σtεz_t = \alpha_t z_0 + \sigma_t \varepsilon:

Ldiff=Et,ε∥ε−εθ(zt;Vcf,Mcf,Mtgt,y)∥22.L_\mathrm{diff} = \mathbb{E}_{t, \varepsilon} \left\| \varepsilon - \varepsilon_\theta(z_t; V_\mathrm{cf}, M_\mathrm{cf}, M_\mathrm{tgt}, y) \right\|_2^2.

4. Training Regimen and Inference Constraints

  • Hardware: 8×NVIDIA H100 GPUs, training time ≈1 week.
  • Optimizer: Adam, learning rate 1×10−41 \times 10^{-4}, batch size 32, 15,000 total iterations.
  • The noise schedule follows standard latent diffusion (β1…βT\beta_1 \dots \beta_T as in CogVideoX).
  • For data variety, each sample randomizes the edit start and chooses either interpolation or resampling mode.
  • Target track dropouts are higher to encourage generalization; inference-time jitter avoids degenerate identity copying.
  • At inference, N≲20N \lesssim 20 for optimal edit fidelity.

5. Experimental Results

A. User Study (4-way, head-to-head)

  • 20 diverse videos tested (object motion, camera, time, mid-stream edits).
  • 41 participants.
  • Baselines: ATI (WAN2.1-based), ReVideo, Go-with-the-Flow.
  • Evaluation criteria: content preservation (Q1), motion fidelity (Q2), overall quality (Q3).
Method Q1 (Content) Q2 (Motion) Q3 (Overall)
Ours 70% 71% 69%
ATI 24% 24% 25%
ReVideo 1% 2% 1%
GWTF 5% 3% 5%

B. Quantitative Photometric Error

  • Test set: 100 videos, split at midpoint, second half reversed for comparison.
  • Metrics: framewise Lâ‚‚, SSIM, LPIPS against ground-truth.
Method L₂ (↓) SSIM (↑) LPIPS (↓)
Ours 0.024 0.098 0.031
ATI 0.038 0.094 0.072
Go-with-the-Flow 0.067 0.089 0.088
ReVideo 0.096 0.080 0.106

C. Qualitative and Ablation Studies

  • Maintains object appearance and follows user-specified, even arbitrary, motion trajectories across eight challenging edit scenarios.
  • Baseline I2V (image-to-video) methods frequently hallucinate missing content or replicate undesired background elements.

Key ablations:

  • Inference jitter ε\varepsilon prevents degenerate copying (e.g., averting unwanted "cloning" in repetitive actions).
  • High target-track dropout enhances generalization.
  • Overly dense points N>20N > 20 reduce edit fidelity during inference.

6. Comparative Analysis

MotionV2V demonstrates systematic advantages over ATI, ReVideo, and Go-with-the-Flow baselines in:

  • Content preservation (appearance, background, and spatial layout).
  • Motion following along user-defined trajectories.
  • Overall user preference by substantial margins in forced-choice studies.

Significant contextual outcomes:

  • Enables edits from any temporal anchor and supports diverse edits such as mid-stream trajectory changes, off-frame object reentrance, and time manipulation.
  • Maintains high fidelity in appearance, attributed to the use of a strong appearance prior (CogVideoX-5B) and explicit trajectory control.

7. Limitations and Prospects

Observed limitations include:

  • Subject drift in extremely long or complex sequential edits, ascribed to foundational model constraints.
  • Reliance on diffusion-generated counterfactuals as opposed to idealized motion ground truth.

Prospective directions include:

  • Leveraging synthetic 3D datasets with known true motion.
  • Reduction in necessary user-provided or automatically detected control points.
  • Advances in VAE and denoising backbones to enable iterative edits without accretive drift.

A plausible implication is that future work on trajectory representations and model architectures may substantially broaden the operational scope and robustness of trajectory-based video editing frameworks (Burgert et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MotionV2V.