MotionV2V: Precise Video Motion Editing

Updated 9 December 2025

MotionV2V is a framework for precise video motion editing that directly manipulates sparse point trajectories for controllable, localized modifications.
It introduces motion counterfactuals and leverages a motion-conditioned diffusion network to alter videos while preserving overall appearance.
Experimental results show superior content preservation, motion fidelity, and user preference compared to leading baseline methods.

MotionV2V is a framework for precise and general video motion editing, grounded in the direct manipulation of sparse motion trajectories. Unlike prior text-to-video and image animation models, MotionV2V formulates video motion editing as the task of altering explicit trajectories within existing videos, enabling controllable and localized modifications that propagate naturally from arbitrary timestamps. It introduces the concept of "motion counterfactuals," pairing the original video with synthetically altered motion while retaining appearance, and employs a motion-conditioned diffusion network to synthesize realistic edited outputs. Extensive experimental evaluation demonstrates its superiority relative to contemporary baselines in fidelity, motion following, and overall user preference (Burgert et al., 25 Nov 2025).

1. Representation of Trajectories and Motion Edits

MotionV2V represents motion using sparse point trajectories tracked through video clips. For point $i \in \{1, \ldots, N\}$ at frame $t \in \{1, \ldots, F\}$ :

The input position is $\mathbf{x}_\mathrm{in}^i(t) = (x^i_\mathrm{in}(t), y^i_\mathrm{in}(t)) \in \mathbb{R}^2$ .
Trajectories are aggregated as $X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}$ for the input and $X_\mathrm{tgt} \in \mathbb{R}^{N \times F \times 2}$ for the edited (target) motion.

The "motion edit" is defined as the per-point, per-frame deviation:

$\Delta^i(t) = x^i_\mathrm{tgt}(t) - x^i_\mathrm{in}(t),$

or $\Delta = X_\mathrm{tgt} - X_\mathrm{in}$ in matrix form.

Regularization and constraints during training include:

Dropout on trajectory channels: low for conditioning tracks (from counterfactuals) and higher for target tracks (forcing generalization to incomplete signals).
Inference-time random "jitter" $\varepsilon^i(t) \sim \text{Uniform}(-2,2)$ px added to $(x,y)$ per point and frame to discourage identity copying.
The number of tracked points at inference is typically capped at $N \approx 20$ , as larger $N$ impairs adherence to specified edits.

2. Synthesis of Motion Counterfactuals

To enable supervised learning of motion-conditioned generation, MotionV2V constructs a counterfactual training dataset of appearance-consistent yet motion-diverse video pairs. The pipeline is as follows for raw input $V_\mathrm{real}$ (length $T$ ), clip length $F$ :

Sample clip start $s \sim \text{Uniform}(0,T{-}F)$ .
Compute target video $V_\mathrm{tgt} = V_\mathrm{real}[s:s+F{-}1]$ .
For counterfactual $V_\mathrm{cf}$ $V_{cf}$ , sample frames $a,b \sim \text{Uniform}(0,T{-}1)$ $a, b \sim Uniform (0, T - 1)$ , $a \ne b$ $a \neq = b$ ; choose either:
- Frame-interpolation: Synthesize $F$ frames via a video diffusion model conditioned on $V_\mathrm{real}[a]$ , $V_\mathrm{real}[b]$ , and a text prompt (e.g., "make the person twirl").
- Temporal-resampling: Uniformly sample $F$ frames from $V_\mathrm{real}[a:b]$ (potentially reversed).
Randomly sample $N \sim \text{Uniform}(1,64)$ initial tracked points $(t^i, x^i, y^i)$ .
Run TAPNext tracker on $V_\mathrm{tgt}$ and $V_\mathrm{cf}$ to produce $X_\mathrm{tgt}$ and $X_\mathrm{cf}$ , respectively.
Apply consistent random spatial augmentations (sliding crops, $\pm 15^\circ$ rotations, scale $\in [0.8, 1.2]$ ) to $V_\mathrm{cf}$ and $X_\mathrm{cf}$ .
Rasterize tracks into F%%%%34 $X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}$ 35%%%%H $\times$ W stacks of colored Gaussian blobs ( $\sigma=10$ px) using $N$ colors.

Parameters include: $F = 49$ frames, input resolution $480 \times 720$ , training from $100{,}000$ counterfactual/target pairs sampled from $500{,}000$ raw videos.

3. Motion-Conditioned Diffusion Network

MotionV2V builds on the CogVideoX-5B DiT (Denoising Transformers) text-to-video backbone and augments it with a ControlNet-inspired branch to integrate motion cues.

Conditioning Inputs (in latent space)

Noisy video: $z_t \in \mathbb{R}^{\text{lat} \times H' \times W'}$
Counterfactual video: $V_\mathrm{cf}$ (latent)
Counterfactual tracks: $M_\mathrm{cf}$ (latent)
Target tracks: $M_\mathrm{tgt}$ (latent)
Optional text prompt $y$ (omitted from loss equations).

Preprocessing

All RGB videos (F%%%%48 $X_\mathrm{in} \in \mathbb{R}^{N \times F \times 2}$ 49%%%%480 $\times$ 720) are encoded via a 3D causal VAE to latents of shape lat $\times$ 60 $\times$ 90, with $\text{lat} = (F-1)/4+1 = 13$ .

Control Branch

The first 18 DiT transformer blocks are duplicated for the control branch.
The three conditioning videos are patchified into 48 spatiotemporal channels.
In each block, control tokens are processed via zero-initialized channelwise MLPs, then added to the main branch tokens, as in ControlNet.
The DiT weights remain frozen; only the control branch is trained.

Diffusion Loss

Let $\varepsilon \sim \mathcal{N}(0,I)$ , $t \sim \text{Uniform}(\{1, \ldots, T\})$ , target latent $z_0$ , and noised $z_t = \alpha_t z_0 + \sigma_t \varepsilon$ :

$L_\mathrm{diff} = \mathbb{E}_{t, \varepsilon} \left\| \varepsilon - \varepsilon_\theta(z_t; V_\mathrm{cf}, M_\mathrm{cf}, M_\mathrm{tgt}, y) \right\|_2^2.$

4. Training Regimen and Inference Constraints

Hardware: 8×NVIDIA H100 GPUs, training time ≈1 week.
Optimizer: Adam, learning rate $1 \times 10^{-4}$ , batch size 32, 15,000 total iterations.
The noise schedule follows standard latent diffusion ( $\beta_1 \dots \beta_T$ as in CogVideoX).
For data variety, each sample randomizes the edit start and chooses either interpolation or resampling mode.
Target track dropouts are higher to encourage generalization; inference-time jitter avoids degenerate identity copying.
At inference, $N \lesssim 20$ for optimal edit fidelity.

5. Experimental Results

A. User Study (4-way, head-to-head)

20 diverse videos tested (object motion, camera, time, mid-stream edits).
41 participants.
Baselines: ATI (WAN2.1-based), ReVideo, Go-with-the-Flow.
Evaluation criteria: content preservation (Q1), motion fidelity (Q2), overall quality (Q3).

Method	Q1 (Content)	Q2 (Motion)	Q3 (Overall)
Ours	70%	71%	69%
ATI	24%	24%	25%
ReVideo	1%	2%	1%
GWTF	5%	3%	5%

B. Quantitative Photometric Error

Test set: 100 videos, split at midpoint, second half reversed for comparison.
Metrics: framewise L₂, SSIM, LPIPS against ground-truth.

Method	L₂ (↓)	SSIM (↑)	LPIPS (↓)
Ours	0.024	0.098	0.031
ATI	0.038	0.094	0.072
Go-with-the-Flow	0.067	0.089	0.088
ReVideo	0.096	0.080	0.106

C. Qualitative and Ablation Studies

Maintains object appearance and follows user-specified, even arbitrary, motion trajectories across eight challenging edit scenarios.
Baseline I2V (image-to-video) methods frequently hallucinate missing content or replicate undesired background elements.

Key ablations:

Inference jitter $\varepsilon$ prevents degenerate copying (e.g., averting unwanted "cloning" in repetitive actions).
High target-track dropout enhances generalization.
Overly dense points $N > 20$ reduce edit fidelity during inference.

6. Comparative Analysis

MotionV2V demonstrates systematic advantages over ATI, ReVideo, and Go-with-the-Flow baselines in:

Content preservation (appearance, background, and spatial layout).
Motion following along user-defined trajectories.
Overall user preference by substantial margins in forced-choice studies.

Significant contextual outcomes:

Enables edits from any temporal anchor and supports diverse edits such as mid-stream trajectory changes, off-frame object reentrance, and time manipulation.
Maintains high fidelity in appearance, attributed to the use of a strong appearance prior (CogVideoX-5B) and explicit trajectory control.

7. Limitations and Prospects

Observed limitations include:

Subject drift in extremely long or complex sequential edits, ascribed to foundational model constraints.
Reliance on diffusion-generated counterfactuals as opposed to idealized motion ground truth.

Prospective directions include:

Leveraging synthetic 3D datasets with known true motion.
Reduction in necessary user-provided or automatically detected control points.
Advances in VAE and denoising backbones to enable iterative edits without accretive drift.

A plausible implication is that future work on trajectory representations and model architectures may substantially broaden the operational scope and robustness of trajectory-based video editing frameworks (Burgert et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

MotionV2V: Editing Motion in a Video (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MotionV2V.