Papers
Topics
Authors
Recent
2000 character limit reached

Wan-Move: Motion-Controllable Video Generation

Updated 11 December 2025
  • The paper introduces dense trajectory guidance that projects motion features into the latent space to achieve fine-grained, high-fidelity video synthesis.
  • It integrates seamlessly into large-scale image-to-video backbones without additional motion encoders, enabling efficient and interpretable motion propagation.
  • Comprehensive evaluations on MoveBench demonstrate state-of-the-art controllability with improved PSNR, SSIM, and reduced EPE compared to prior methods.

Wan-Move is a motion-controllable video generation framework that achieves fine-grained, interpretable, and scalable motion guidance in generative video models. Its core innovation is the guidance of video synthesis by directly rendering the original visual condition features motion-aware via explicit propagation along dense point trajectories projected into latent space. This method enables high-fidelity video generation with precise, localized control of object and scene motion, and can be naturally integrated into large-scale image-to-video (I2V) backbones without architectural modification. Comprehensive evaluation on the MoveBench benchmark and public datasets demonstrates Wan-Move's state-of-the-art controllability, robustness, and efficiency compared to prior video generation systems (Chu et al., 9 Dec 2025).

1. Motion Control via Dense Point Trajectory Guidance

Wan-Move models scene and object motion using a dense grid of tracked point trajectories. Specifically, motion is represented as {τi}i=1K\{\tau_i\}_{i=1}^K with each trajectory τi={(xit,yit)}t=0T\tau_i = \{(x^t_i, y^t_i)\}_{t=0}^T denoting the pixel-space location of track ii at video frame tt. By default, trajectories are organized on a uniform G×GG \times G grid (e.g., 32×3232 \times 32), so K=G2K = G^2, covering the entire spatial extent of the first frame.

These trajectories are:

  • User-specified or automatically extracted: Wan-Move enables user-guided motion or adapts to motion observed in a reference video.
  • Fine-grained and dense: The grid approach covers local and global motion, supporting precise per-region edits and broad scene adjustments.
  • Flexible for compositional motion: Trajectories can correspond to multiple objects or background, supporting multi-entity, multi-part, or holistic control.

Contextually, this dense trajectory guidance advances earlier motion control paradigms which typically leverage sparse manipulable signals (e.g., drag-points, bounding boxes, or low-frequency flow) and thus lacked fine spatial or temporal resolution (Chu et al., 9 Dec 2025).

2. Latent-Space Trajectory Projection and First-Frame Feature Propagation

Wan-Move's mechanism for motion guidance avoids adding specialized motion encoders. Instead, the trajectory set {τi}\{\tau_i\} is projected directly into the VAE latent space where video synthesis occurs:

  • Spatial/temporal compression factors: Given the VAE encoder's spatial (fsf_s) and temporal (ftf_t) compressions, trajectories are downsampled to yield latent trajectories τ~i=M(τi)\tilde{\tau}_i = M(\tau_i).
  • Feature propagation: For each trajectory ii and time-step nn, the latent variable at (n,h,w)(n, h, w) is assigned the feature value corresponding to the original first-frame feature along τ~i\tilde{\tau}_i. When multiple tracks coincide at a latent index, the feature is assigned by random selection.
  • Updated latent condition: This propagates the first-frame feature tensor ZimageZ_{\text{image}} through time according to the trajectories, constructing a spatiotemporally aligned latent map dictating object/scene motion.

This process is architecture-agnostic, requiring no ControlNet or motion-encoder, and retains full compatibility with modern DiT/U-Net backbones. Notably, this sidesteps the complexity and scalability bottlenecks of systems relying on dedicated conditional modules or fused multi-modal handlers (Chu et al., 9 Dec 2025).

3. Integration and Inference in Large-Scale I2V Diffusion Models

Wan-Move inserts the motion-conditioned latent as a new image condition tensor within a pretrained I2V backbone (exemplified by Wan-I2V-14B):

  • No architecture change: The propagated latent replaces the standard first-frame feature in the denoising diffusion model. The condition is concatenated along the channel dimension to the latent noise variable before each diffusion update.
  • Classifier-free guidance: Sampling leverages conditional/unconditional scores with adjustable scale, maintaining performance benefits of state-of-the-art diffusion approaches.
  • Efficiency and scalability: Motion conditioning occurs entirely as latent tensor manipulation, compatible with FSDP and large mini-batch scheduling. The backbone can be fine-tuned scalably under this paradigm (~2M videos, rapid convergence at ~30K steps) (Chu et al., 9 Dec 2025).

Pseudocode for the main inference loop is outlined as follows (see (Chu et al., 9 Dec 2025) for full algorithmic details):

1
2
3
4
5
6
7
8
9
10
11
12
Input: first frame I, trajectories {tau_i}, text prompt, CLIP/text features
1. Z_image = VAE_Encode(concat[I, zeros_T])
2. For each tau_i:
     Project to latent via M(tau_i)
     For n=1..T/f_t:
         Propagate: Z_image[n,h,w,:] = Z_image[0,h0,w0,:]
3. Sample x_T ~ N(0, I)
4. For t = T..1:
     v_uncond = v_theta(x_t, t, Z_image, z_global)
     v_cond = v_theta(x_t, t, Z_image, z_global, z_text)
     x_{t-1} = sampler_step(x_t, v_uncond + w*(v_cond-v_uncond))
5. Decode with VAE

4. Evaluation: MoveBench Benchmark and Quantitative Performance

The MoveBench benchmark is introduced for rigorous, large-scale motion-controllable video evaluation:

  • Composition: 1,018 videos (5s duration, 81 frames, 480p), 54 diverse content categories, single/multi-object scenarios, annotated with dense point tracks and segmentation masks.
  • Metrics: Video/image fidelity (FID, FVD, PSNR, SSIM), motion accuracy (end-point error, EPE), and user preference (2AFC studies) (Chu et al., 9 Dec 2025).
  • Results: Wan-Move achieves FID 12.2, FVD 83.5, PSNR 17.8, SSIM 0.64, EPE 2.6 on single-object cases. Ablations reveal further gains from dense trajectory count at inference (up to EPE 1.1, PSNR 21.9, SSIM 0.79 at 1024-point density).

Comparative user studies (2AFC) demonstrate nearly unanimous preference for Wan-Move across motion accuracy (98%), visual quality (98.8%) versus strong baselines such as LeviTor, and at-par performance against high-end proprietary tools (Kling 1.5 Pro) (Chu et al., 9 Dec 2025). MoveBench's scale and annotation fidelity enable robust, reproducible evaluation, setting a new standard for the community.

5. Ablation Studies and Comparative Analysis

Extensive ablations clarify the significance of Wan-Move's design choices:

Strategy FID (↓) FVD (↓) EPE (↓) PSNR (↑) SSIM (↑) Latency
Pixel-level replication 3.7 15.3 0.56
Random latent embedding 2.7 16.1 0.59
Latent-feature replication (ours) 12.2 83.5 2.6 17.8 0.64 +3s
ControlNet injection 12.4 2.5 +225s
  • Latent-feature propagation delivers optimal accuracy and efficiency.
  • Dense trajectory usage at inference monotonically improves EPE and spatiotemporal quality.
  • Direct concatenation is computationally efficient versus adapters like ControlNet.
  • Too many tracks at training may cause mismatch with sparse downstream tasks, while inference accepts extreme densification.
  • Backbone/data-scale ablations confirm consistent superiority to MagicMotion, Tora, and other contemporary methods.
  • Robustness is observed on extreme or out-of-domain scenarios (retaining up to 40% lower error) (Chu et al., 9 Dec 2025).

6. Applications and Extensions

Wan-Move's latent trajectory guidance paradigm supports a broad set of controllable video synthesis operations:

  • Single-/multi-object dragging: Direct region-of-interest manipulation through trajectory specification.
  • Camera motion: Pan/dolly effects by specifying global trajectory fields.
  • 3D object rotation: Achieved by propagating first-frame features via depth-to-3D-to-2D projected tracks.
  • Motion transfer: Application of extracted trajectories from source to arbitrary reference image frames enables motion-style transfer.
  • Compositional editing: Multiple motion conditions can be superimposed, permitting complex scene or style replays (Chu et al., 9 Dec 2025).

This suggests broad utility for animation, visual effects, scientific visualization, and content creation, with direct user-editable motion fields and seamless integration atop scalable video diffusion models. The architecture's plug-and-play nature—with no need for motion encoders or fusion blocks—implies high extensibility to new backbones or modalities.

7. Position within the Motion-Controllable Video Generation Landscape

Wan-Move's introduction marks a shift in controllable video generation:

  • Relative to mask-based and optical flow methods: Unlike mask guidance (Feng et al., 24 Mar 2025), explicit inpainting (Hu et al., 2023), or flow-based priors (Lei et al., 16 Nov 2024), Wan-Move achieves state-of-the-art controllability and fidelity with a minimal and generalizable procedural change at the latent condition level.
  • Advantages over token/disentanglement strategies: Token-level approaches such as TokenMotion (Li et al., 11 Apr 2025) yield fine-grained control but often require elaborate separate encoding and fusion pipelines; Wan-Move's direct manipulation in latent space achieves similar or better efficacy with simpler integration.
  • In contrast to reward or reinforcement mechanisms: No reinforcement or reward-based objective or module is present in the Wan-Move pipeline; all supervision derives from a standard flow-matching loss on the motion-conditioned backbone.

Interpreted broadly, Wan-Move establishes latent-feature propagation along dense, user-editable trajectories as a foundational method for precise, efficient video motion control (Chu et al., 9 Dec 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Wan-Move: Motion-Controllable Video Generation.