Wan-Move: Motion-Controllable Video Generation
- The paper introduces dense trajectory guidance that projects motion features into the latent space to achieve fine-grained, high-fidelity video synthesis.
- It integrates seamlessly into large-scale image-to-video backbones without additional motion encoders, enabling efficient and interpretable motion propagation.
- Comprehensive evaluations on MoveBench demonstrate state-of-the-art controllability with improved PSNR, SSIM, and reduced EPE compared to prior methods.
Wan-Move is a motion-controllable video generation framework that achieves fine-grained, interpretable, and scalable motion guidance in generative video models. Its core innovation is the guidance of video synthesis by directly rendering the original visual condition features motion-aware via explicit propagation along dense point trajectories projected into latent space. This method enables high-fidelity video generation with precise, localized control of object and scene motion, and can be naturally integrated into large-scale image-to-video (I2V) backbones without architectural modification. Comprehensive evaluation on the MoveBench benchmark and public datasets demonstrates Wan-Move's state-of-the-art controllability, robustness, and efficiency compared to prior video generation systems (Chu et al., 9 Dec 2025).
1. Motion Control via Dense Point Trajectory Guidance
Wan-Move models scene and object motion using a dense grid of tracked point trajectories. Specifically, motion is represented as with each trajectory denoting the pixel-space location of track at video frame . By default, trajectories are organized on a uniform grid (e.g., ), so , covering the entire spatial extent of the first frame.
These trajectories are:
- User-specified or automatically extracted: Wan-Move enables user-guided motion or adapts to motion observed in a reference video.
- Fine-grained and dense: The grid approach covers local and global motion, supporting precise per-region edits and broad scene adjustments.
- Flexible for compositional motion: Trajectories can correspond to multiple objects or background, supporting multi-entity, multi-part, or holistic control.
Contextually, this dense trajectory guidance advances earlier motion control paradigms which typically leverage sparse manipulable signals (e.g., drag-points, bounding boxes, or low-frequency flow) and thus lacked fine spatial or temporal resolution (Chu et al., 9 Dec 2025).
2. Latent-Space Trajectory Projection and First-Frame Feature Propagation
Wan-Move's mechanism for motion guidance avoids adding specialized motion encoders. Instead, the trajectory set is projected directly into the VAE latent space where video synthesis occurs:
- Spatial/temporal compression factors: Given the VAE encoder's spatial () and temporal () compressions, trajectories are downsampled to yield latent trajectories .
- Feature propagation: For each trajectory and time-step , the latent variable at is assigned the feature value corresponding to the original first-frame feature along . When multiple tracks coincide at a latent index, the feature is assigned by random selection.
- Updated latent condition: This propagates the first-frame feature tensor through time according to the trajectories, constructing a spatiotemporally aligned latent map dictating object/scene motion.
This process is architecture-agnostic, requiring no ControlNet or motion-encoder, and retains full compatibility with modern DiT/U-Net backbones. Notably, this sidesteps the complexity and scalability bottlenecks of systems relying on dedicated conditional modules or fused multi-modal handlers (Chu et al., 9 Dec 2025).
3. Integration and Inference in Large-Scale I2V Diffusion Models
Wan-Move inserts the motion-conditioned latent as a new image condition tensor within a pretrained I2V backbone (exemplified by Wan-I2V-14B):
- No architecture change: The propagated latent replaces the standard first-frame feature in the denoising diffusion model. The condition is concatenated along the channel dimension to the latent noise variable before each diffusion update.
- Classifier-free guidance: Sampling leverages conditional/unconditional scores with adjustable scale, maintaining performance benefits of state-of-the-art diffusion approaches.
- Efficiency and scalability: Motion conditioning occurs entirely as latent tensor manipulation, compatible with FSDP and large mini-batch scheduling. The backbone can be fine-tuned scalably under this paradigm (~2M videos, rapid convergence at ~30K steps) (Chu et al., 9 Dec 2025).
Pseudocode for the main inference loop is outlined as follows (see (Chu et al., 9 Dec 2025) for full algorithmic details):
1 2 3 4 5 6 7 8 9 10 11 12 |
Input: first frame I, trajectories {tau_i}, text prompt, CLIP/text features
1. Z_image = VAE_Encode(concat[I, zeros_T])
2. For each tau_i:
Project to latent via M(tau_i)
For n=1..T/f_t:
Propagate: Z_image[n,h,w,:] = Z_image[0,h0,w0,:]
3. Sample x_T ~ N(0, I)
4. For t = T..1:
v_uncond = v_theta(x_t, t, Z_image, z_global)
v_cond = v_theta(x_t, t, Z_image, z_global, z_text)
x_{t-1} = sampler_step(x_t, v_uncond + w*(v_cond-v_uncond))
5. Decode with VAE |
4. Evaluation: MoveBench Benchmark and Quantitative Performance
The MoveBench benchmark is introduced for rigorous, large-scale motion-controllable video evaluation:
- Composition: 1,018 videos (5s duration, 81 frames, 480p), 54 diverse content categories, single/multi-object scenarios, annotated with dense point tracks and segmentation masks.
- Metrics: Video/image fidelity (FID, FVD, PSNR, SSIM), motion accuracy (end-point error, EPE), and user preference (2AFC studies) (Chu et al., 9 Dec 2025).
- Results: Wan-Move achieves FID 12.2, FVD 83.5, PSNR 17.8, SSIM 0.64, EPE 2.6 on single-object cases. Ablations reveal further gains from dense trajectory count at inference (up to EPE 1.1, PSNR 21.9, SSIM 0.79 at 1024-point density).
Comparative user studies (2AFC) demonstrate nearly unanimous preference for Wan-Move across motion accuracy (98%), visual quality (98.8%) versus strong baselines such as LeviTor, and at-par performance against high-end proprietary tools (Kling 1.5 Pro) (Chu et al., 9 Dec 2025). MoveBench's scale and annotation fidelity enable robust, reproducible evaluation, setting a new standard for the community.
5. Ablation Studies and Comparative Analysis
Extensive ablations clarify the significance of Wan-Move's design choices:
| Strategy | FID (↓) | FVD (↓) | EPE (↓) | PSNR (↑) | SSIM (↑) | Latency |
|---|---|---|---|---|---|---|
| Pixel-level replication | — | — | 3.7 | 15.3 | 0.56 | — |
| Random latent embedding | — | — | 2.7 | 16.1 | 0.59 | — |
| Latent-feature replication (ours) | 12.2 | 83.5 | 2.6 | 17.8 | 0.64 | +3s |
| ControlNet injection | 12.4 | — | 2.5 | — | — | +225s |
- Latent-feature propagation delivers optimal accuracy and efficiency.
- Dense trajectory usage at inference monotonically improves EPE and spatiotemporal quality.
- Direct concatenation is computationally efficient versus adapters like ControlNet.
- Too many tracks at training may cause mismatch with sparse downstream tasks, while inference accepts extreme densification.
- Backbone/data-scale ablations confirm consistent superiority to MagicMotion, Tora, and other contemporary methods.
- Robustness is observed on extreme or out-of-domain scenarios (retaining up to 40% lower error) (Chu et al., 9 Dec 2025).
6. Applications and Extensions
Wan-Move's latent trajectory guidance paradigm supports a broad set of controllable video synthesis operations:
- Single-/multi-object dragging: Direct region-of-interest manipulation through trajectory specification.
- Camera motion: Pan/dolly effects by specifying global trajectory fields.
- 3D object rotation: Achieved by propagating first-frame features via depth-to-3D-to-2D projected tracks.
- Motion transfer: Application of extracted trajectories from source to arbitrary reference image frames enables motion-style transfer.
- Compositional editing: Multiple motion conditions can be superimposed, permitting complex scene or style replays (Chu et al., 9 Dec 2025).
This suggests broad utility for animation, visual effects, scientific visualization, and content creation, with direct user-editable motion fields and seamless integration atop scalable video diffusion models. The architecture's plug-and-play nature—with no need for motion encoders or fusion blocks—implies high extensibility to new backbones or modalities.
7. Position within the Motion-Controllable Video Generation Landscape
Wan-Move's introduction marks a shift in controllable video generation:
- Relative to mask-based and optical flow methods: Unlike mask guidance (Feng et al., 24 Mar 2025), explicit inpainting (Hu et al., 2023), or flow-based priors (Lei et al., 16 Nov 2024), Wan-Move achieves state-of-the-art controllability and fidelity with a minimal and generalizable procedural change at the latent condition level.
- Advantages over token/disentanglement strategies: Token-level approaches such as TokenMotion (Li et al., 11 Apr 2025) yield fine-grained control but often require elaborate separate encoding and fusion pipelines; Wan-Move's direct manipulation in latent space achieves similar or better efficacy with simpler integration.
- In contrast to reward or reinforcement mechanisms: No reinforcement or reward-based objective or module is present in the Wan-Move pipeline; all supervision derives from a standard flow-matching loss on the motion-conditioned backbone.
Interpreted broadly, Wan-Move establishes latent-feature propagation along dense, user-editable trajectories as a foundational method for precise, efficient video motion control (Chu et al., 9 Dec 2025).