Wan-Move: Motion-Controllable Video Generation

Updated 11 December 2025

The paper introduces dense trajectory guidance that projects motion features into the latent space to achieve fine-grained, high-fidelity video synthesis.
It integrates seamlessly into large-scale image-to-video backbones without additional motion encoders, enabling efficient and interpretable motion propagation.
Comprehensive evaluations on MoveBench demonstrate state-of-the-art controllability with improved PSNR, SSIM, and reduced EPE compared to prior methods.

Wan-Move is a motion-controllable video generation framework that achieves fine-grained, interpretable, and scalable motion guidance in generative video models. Its core innovation is the guidance of video synthesis by directly rendering the original visual condition features motion-aware via explicit propagation along dense point trajectories projected into latent space. This method enables high-fidelity video generation with precise, localized control of object and scene motion, and can be naturally integrated into large-scale image-to-video (I2V) backbones without architectural modification. Comprehensive evaluation on the MoveBench benchmark and public datasets demonstrates Wan-Move's state-of-the-art controllability, robustness, and efficiency compared to prior video generation systems (Chu et al., 9 Dec 2025).

1. Motion Control via Dense Point Trajectory Guidance

Wan-Move models scene and object motion using a dense grid of tracked point trajectories. Specifically, motion is represented as $\{\tau_i\}_{i=1}^K$ with each trajectory $\tau_i = \{(x^t_i, y^t_i)\}_{t=0}^T$ denoting the pixel-space location of track $i$ at video frame $t$ . By default, trajectories are organized on a uniform $G \times G$ grid (e.g., $32 \times 32$ ), so $K = G^2$ , covering the entire spatial extent of the first frame.

These trajectories are:

User-specified or automatically extracted: Wan-Move enables user-guided motion or adapts to motion observed in a reference video.
Fine-grained and dense: The grid approach covers local and global motion, supporting precise per-region edits and broad scene adjustments.
Flexible for compositional motion: Trajectories can correspond to multiple objects or background, supporting multi-entity, multi-part, or holistic control.

Contextually, this dense trajectory guidance advances earlier motion control paradigms which typically leverage sparse manipulable signals (e.g., drag-points, bounding boxes, or low-frequency flow) and thus lacked fine spatial or temporal resolution (Chu et al., 9 Dec 2025).

2. Latent-Space Trajectory Projection and First-Frame Feature Propagation

Wan-Move's mechanism for motion guidance avoids adding specialized motion encoders. Instead, the trajectory set $\{\tau_i\}$ is projected directly into the VAE latent space where video synthesis occurs:

Spatial/temporal compression factors: Given the VAE encoder's spatial ( $f_s$ ) and temporal ( $f_t$ ) compressions, trajectories are downsampled to yield latent trajectories $\tilde{\tau}_i = M(\tau_i)$ .
Feature propagation: For each trajectory $i$ and time-step $n$ , the latent variable at $(n, h, w)$ is assigned the feature value corresponding to the original first-frame feature along $\tilde{\tau}_i$ . When multiple tracks coincide at a latent index, the feature is assigned by random selection.
Updated latent condition: This propagates the first-frame feature tensor $Z_{\text{image}}$ through time according to the trajectories, constructing a spatiotemporally aligned latent map dictating object/scene motion.

This process is architecture-agnostic, requiring no ControlNet or motion-encoder, and retains full compatibility with modern DiT/U-Net backbones. Notably, this sidesteps the complexity and scalability bottlenecks of systems relying on dedicated conditional modules or fused multi-modal handlers (Chu et al., 9 Dec 2025).

3. Integration and Inference in Large-Scale I2V Diffusion Models

Wan-Move inserts the motion-conditioned latent as a new image condition tensor within a pretrained I2V backbone (exemplified by Wan-I2V-14B):

No architecture change: The propagated latent replaces the standard first-frame feature in the denoising diffusion model. The condition is concatenated along the channel dimension to the latent noise variable before each diffusion update.
Classifier-free guidance: Sampling leverages conditional/unconditional scores with adjustable scale, maintaining performance benefits of state-of-the-art diffusion approaches.
Efficiency and scalability: Motion conditioning occurs entirely as latent tensor manipulation, compatible with FSDP and large mini-batch scheduling. The backbone can be fine-tuned scalably under this paradigm (~2M videos, rapid convergence at ~30K steps) (Chu et al., 9 Dec 2025).

Pseudocode for the main inference loop is outlined as follows (see (Chu et al., 9 Dec 2025) for full algorithmic details):

Input: first frame I, trajectories {tau_i}, text prompt, CLIP/text features
1. Z_image = VAE_Encode(concat[I, zeros_T])
2. For each tau_i:
     Project to latent via M(tau_i)
     For n=1..T/f_t:
         Propagate: Z_image[n,h,w,:] = Z_image[0,h0,w0,:]
3. Sample x_T ~ N(0, I)
4. For t = T..1:
     v_uncond = v_theta(x_t, t, Z_image, z_global)
     v_cond = v_theta(x_t, t, Z_image, z_global, z_text)
     x_{t-1} = sampler_step(x_t, v_uncond + w*(v_cond-v_uncond))
5. Decode with VAE

4. Evaluation: MoveBench Benchmark and Quantitative Performance

The MoveBench benchmark is introduced for rigorous, large-scale motion-controllable video evaluation:

Composition: 1,018 videos (5s duration, 81 frames, 480p), 54 diverse content categories, single/multi-object scenarios, annotated with dense point tracks and segmentation masks.
Metrics: Video/image fidelity (FID, FVD, PSNR, SSIM), motion accuracy (end-point error, EPE), and user preference (2AFC studies) (Chu et al., 9 Dec 2025).
Results: Wan-Move achieves FID 12.2, FVD 83.5, PSNR 17.8, SSIM 0.64, EPE 2.6 on single-object cases. Ablations reveal further gains from dense trajectory count at inference (up to EPE 1.1, PSNR 21.9, SSIM 0.79 at 1024-point density).

Comparative user studies (2AFC) demonstrate nearly unanimous preference for Wan-Move across motion accuracy (98%), visual quality (98.8%) versus strong baselines such as LeviTor, and at-par performance against high-end proprietary tools (Kling 1.5 Pro) (Chu et al., 9 Dec 2025). MoveBench's scale and annotation fidelity enable robust, reproducible evaluation, setting a new standard for the community.

5. Ablation Studies and Comparative Analysis

Extensive ablations clarify the significance of Wan-Move's design choices:

Strategy	FID (↓)	FVD (↓)	EPE (↓)	PSNR (↑)	SSIM (↑)	Latency
Pixel-level replication	—	—	3.7	15.3	0.56	—
Random latent embedding	—	—	2.7	16.1	0.59	—
Latent-feature replication (ours)	12.2	83.5	2.6	17.8	0.64	+3s
ControlNet injection	12.4	—	2.5	—	—	+225s

Latent-feature propagation delivers optimal accuracy and efficiency.
Dense trajectory usage at inference monotonically improves EPE and spatiotemporal quality.
Direct concatenation is computationally efficient versus adapters like ControlNet.
Too many tracks at training may cause mismatch with sparse downstream tasks, while inference accepts extreme densification.
Backbone/data-scale ablations confirm consistent superiority to MagicMotion, Tora, and other contemporary methods.
Robustness is observed on extreme or out-of-domain scenarios (retaining up to 40% lower error) (Chu et al., 9 Dec 2025).

6. Applications and Extensions

Wan-Move's latent trajectory guidance paradigm supports a broad set of controllable video synthesis operations:

Single-/multi-object dragging: Direct region-of-interest manipulation through trajectory specification.
Camera motion: Pan/dolly effects by specifying global trajectory fields.
3D object rotation: Achieved by propagating first-frame features via depth-to-3D-to-2D projected tracks.
Motion transfer: Application of extracted trajectories from source to arbitrary reference image frames enables motion-style transfer.
Compositional editing: Multiple motion conditions can be superimposed, permitting complex scene or style replays (Chu et al., 9 Dec 2025).

This suggests broad utility for animation, visual effects, scientific visualization, and content creation, with direct user-editable motion fields and seamless integration atop scalable video diffusion models. The architecture's plug-and-play nature—with no need for motion encoders or fusion blocks—implies high extensibility to new backbones or modalities.

7. Position within the Motion-Controllable Video Generation Landscape

Wan-Move's introduction marks a shift in controllable video generation:

Relative to mask-based and optical flow methods: Unlike mask guidance (Feng et al., 24 Mar 2025), explicit inpainting (Hu et al., 2023), or flow-based priors (Lei et al., 16 Nov 2024), Wan-Move achieves state-of-the-art controllability and fidelity with a minimal and generalizable procedural change at the latent condition level.
Advantages over token/disentanglement strategies: Token-level approaches such as TokenMotion (Li et al., 11 Apr 2025) yield fine-grained control but often require elaborate separate encoding and fusion pipelines; Wan-Move's direct manipulation in latent space achieves similar or better efficacy with simpler integration.
In contrast to reward or reinforcement mechanisms: No reinforcement or reward-based objective or module is present in the Wan-Move pipeline; all supervision derives from a standard flow-matching loss on the motion-conditioned backbone.

Interpreted broadly, Wan-Move establishes latent-feature propagation along dense, user-editable trajectories as a foundational method for precise, efficient video motion control (Chu et al., 9 Dec 2025).