Wan: Open and Advanced Large-Scale Video Generative Models (2503.20314v2)

Published 26 Mar 2025 in cs.CV

Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.

PDF Abstract

Wan: Open and Advanced Large-Scale Video Generative Models introduces Wan, a comprehensive and open suite of video foundation models. Built on the diffusion transformer paradigm, Wan pushes the boundaries of video generation through innovations in spatio-temporal VAE design, scalable pre-training strategies, large-scale data curation, and automated evaluation. The work emphasizes practical implementation, efficiency, and openness to foster community growth.

The key features of Wan are:

Leading Performance: The 14B model, trained on billions of images and videos, demonstrates strong scaling properties and outperforms existing open-source and commercial solutions on multiple benchmarks.
Comprehensiveness: Wan offers two models (1.3B and 14B) balancing efficiency and effectiveness, supporting up to eight downstream tasks including image-to-video, instruction-guided video editing, and personalized video generation. It is the first model to generate visual text in both Chinese and English.
Consumer-Grade Efficiency: The 1.3B model requires only 8.19 GB VRAM, making it accessible on consumer GPUs, while maintaining competitive performance for text-to-video tasks.
Openness: The entire Wan series, including code and models, is open-sourced to advance the field.

Data Processing Pipeline

High-quality, diverse, and large-scale data is crucial. The data construction pipeline involves:

Pre-training Data: Curated from internal and public sources, processed through a four-step cleaning process:
- Fundamental Dimensions: Filtering based on text coverage, aesthetic score (LAION-5B classifier), NSFW score, watermarks, black borders, overexposure, synthetic images, blur, duration, and resolution.
- Visual Quality: Clustering data (e.g., into 100 clusters) and scoring using an expert assessment model trained on manual annotations to select high-quality data while preserving distribution.
- Motion Quality: Classifying videos into six tiers (optimal, medium, static, camera-driven, low-quality, shaky) based on motion analysis (amplitude, smoothness, jitter). Lower sampling priority or exclusion is applied to static, camera-driven, and low-quality/shaky videos.
- Visual Text Data: Synthesizing text-containing images (Chinese/English on white background) and collecting real-world text images. OCR models recognize text, which is then used by a multimodal LLM (Qwen2-VL) to generate dense descriptions incorporating precise text content. This integrates both synthetic and real data for robust visual text generation.
Post-training Data: Focuses on improving visual fidelity and motion dynamics with high-quality data.
- Image Processing: Selecting top-quality images based on scores, composition, and details, manually collecting curated images to ensure diversity and fill missing concepts.
- Video Processing: Filtering top-ranked videos based on visual and motion quality, selecting millions of videos with simple and complex movements, ensuring category balance (12 major categories).
Dense Video Caption: To improve prompt adherence (inspired by DALL-E 3), an internal caption model generates dense captions for each image and video.
- Open Source Dataset: Collected widely used vision-language datasets (captioning, visual QA, text instructions).
- In-house Dataset: Curated for specific tasks like recognizing celebrities/landmarks/characters (using LLMs and CLIP-style models), object counting (using LLM and Grounding DINO), OCR-augmented captions (using OCR results as prior), camera angle/motion prediction (annotating videos and training expert models), fine-grained categories, relational understanding, re-captioning, editing instruction captions, and group image descriptions. Human-annotated data is used in the final stage.
- Model Design: A LLaVA-style architecture (Liu et al., 2023 ) with a ViT encoder, a 2-layer MLP projector, and a Qwen LLM (Wang et al., 5 Sep 2024 ). Supports dynamic high resolution for images and slow-fast encoding for videos to reduce computation.
- Training: Three stages: MLP training (freeze ViT/LLM), all parameters trainable, end-to-end fine-tuning on high-quality data.
- Evaluation: Automated caption evaluation pipeline using CAPability (Liu et al., 19 Feb 2025 ) focusing on ten dimensions, comparing against Google Gemini 1.5 Pro.

Model Design and Acceleration

Wan is based on the Diffusion Transformer (DiT) (Shen et al., 2023 ) architecture, consisting of Wan-VAE, a diffusion transformer backbone, and a text encoder.

Spatio-temporal Variational Autoencoder (Wan-VAE):
- Model Design: A novel 3D causal VAE architecture (Fig. 4) compresses video spatio-temporal dimensions by $4\times8\times8$ . Uses RMSNorm (Bennett et al., 2019 ) for temporal causality and halves input channels in spatial upsampling for memory reduction. Compact size (127M parameters).
- Training: Three stages: Train 2D image VAE, inflate to 3D causal VAE and train on low-res/few-frame videos (L1, KL, LPIPS loss), fine-tune on high-quality videos with 3D GAN loss (Gao et al., 2020 ).
- Efficient Inference: Feature cache mechanism for causal convolution and temporal downsampling supports arbitrarily long videos chunk-wise (Fig. 5). Maintains frame-level feature caches from preceding chunks to ensure continuity.
- Evaluation: Quantitative evaluation (PSNR, Frame per second) against SOTA video VAEs (Fig. 6) shows competitive performance and efficiency (2.5x faster reconstruction than HunyuanVideo (Kong et al., 3 Dec 2024 )). Qualitative results (Fig. 7) show superior detail preservation and sharpness.
Video Diffusion Transformer:
- Diffusion Transformer: Uses a patchifying module (3D convolution), transformer blocks (Fig. 8), and an unpatchifying module. Cross-attention embeds text conditions. A shared MLP processes time embeddings and predicts modulation parameters, reducing parameter count by ~25% and improving performance.
- Text Encoder: Uses umT5 (Chung et al., 2023 ) due to strong multilingual encoding (Chinese/English), ability to understand visual text, better composition than unidirectional models, and faster convergence.
Model Training: Leverages the flow matching framework (Lipman et al., 2022 , Agarwal et al., 22 Mar 2024 ) for a unified denoising process.
- Training Objective: Predicts the velocity $v_t = x_1 - x_0$ given $x_t = t x_1 + (1-t) x_0$ and text condition $c_{txt}$ , minimizing MSE loss $\mathcal{L} = \mathbb{E}||u(x_t, c_{txt}, t; \theta)-v_t||^2$ .
- Image Pre-training: Initializes 14B model with low-resolution (256px) text-to-image pre-training to establish cross-modal alignment and structure fidelity.
- Image-video Joint Training: Staged training with progressively increasing resolutions (256px, 480px, 720px) and fixed 5-second video duration.
- Pre-training Configurations: bf16-mixed precision, AdamW optimizer (Loshchilov et al., 2017 ), dynamic learning rate.
Post-training: Fine-tunes the pre-trained checkpoint on high-quality post-training data at 480px and 720px resolutions.
Model Scaling and Training Efficiency:
- Workload Analysis: Attention mechanism is the primary bottleneck, with computational cost quadratic in sequence length but memory linear. Activation storage is significant in long sequences.
- Parallelism Strategy: DP + FSDP (Zhao et al., 2023 ) for VAE and Text Encoder. DiT uses a combination of DP and 2D Context Parallelism (CP) (Jacobs et al., 2023 , Liu et al., 2023 ) (Fig. 9), combining Ring Attention and Ulysses for efficient sharding along the sequence dimension and minimizing communication overhead. A strategy switching mechanism avoids redundant computation when VAE/Text Encoder outputs are fed to DiT.
- Memory Optimization: Prioritizes activation offloading (Hendrycks et al., 2016 ) and combines it with Gradient Checkpointing (GC) (Chen et al., 2016 ) for layers with high memory-to-computation ratios.
- Cluster Reliability: Utilizes Alibaba Cloud's scheduling, slow machine detection, and self-healing for high stability.
Inference: Aims to minimize latency.
- Parallel Strategy: Uses CP and FSDP for model sharding and latency reduction (Fig. 10 shows near-linear speedup).
- Diffusion Cache: Leverages attention and CFG similarities across sampling steps (similar to DiTFastAttn (Lv et al., 25 Oct 2024 ) and FasterCache (Lv et al., 25 Oct 2024 )) to reduce computation. Caches attention results and reuses conditional DiT outputs for unconditional parts with residual compensation. Improves inference performance by 1.62x.
- Quantization: FP8 GEMM delivers 2x performance of BF16 and 1.13x speedup in DiT. 8-bit FlashAttention (optimized FA3-FP8 with mixed Int8/FP8 and FP32 accumulation) boosts efficiency by >1.27x on Hopper GPUs. Int8 and TensorRT quantization are explored for consumer-level devices.
Prompt Alignment: Aligns user prompts with training captions.
- Augments training data with diverse captions.
- Rewrites user prompts using LLMs (e.g., Qwen2.5-Plus (Wang et al., 5 Sep 2024 )) to add details, incorporate motion attributes, and structure them like post-training captions (Table 1).
Benchmarks: Proposes Wan-Bench for automated, comprehensive, and human-aligned evaluation.
- Three core dimensions: Dynamic Quality (Large motion generation, Human artifacts, Physical plausibility & smoothness, Pixel-level stability, ID consistency), Image Quality (Comprehensive image quality, Scene generation quality, Stylization), and Instruction Following (Single/Multiple objects, Spatial positions, Camera control, Action instruction following).
- Uses traditional detectors and MLLMs (Qwen2-VL) for scoring.
- A human feedback guided weighting strategy aligns dimension scores with user preferences (Section 4.6.1).

Evaluation

Metrics and Results: Wan is compared with commercial (Kling [2406.xxxx], Hailuo [2409.xxxx], Sora [2024], Runway [2406.xxxx], Vidu (Bao et al., 7 May 2024 )) and open-source (Mochi [2024], CogVideoX [2025], Hunyuan (Kong et al., 3 Dec 2024 )) models.
- Quantitative: Wan-Bench evaluation on 1,035 samples shows Wan outperforms competitors on weighted scores (Table 2).
- Qualitative: Wan generates diverse, high-quality videos with complex motions, physical interactions, artistic styles, cinematic visuals, and accurate bilingual text (Fig. 1, 3, 12).
- Human Evaluation: Pairwise comparisons on >700 tasks annotated by >20 individuals show Wan 14B excels across visual quality, motion quality, matching, and overall ranking (Table 3).
- Public Leaderboard: Wan 14B achieves state-of-the-art on VBench 2023, outperforming Sora [2024] and Hailuo [2409.xxxx]. Wan 1.3B (83.96%) also surpasses several larger commercial and open-source models (Table 4).
Ablation Study: Experiments on the 1.3B version.
- Adaptive normalization: Fully shared AdaLN (Clark et al., 2018 ) is more parameter-efficient and performs better than non-shared or partially shared configurations, supporting focus on model depth (Fig. 13).
- Text encoder: umT5 shows superior text embedding performance compared to Qwen2.5-7B-Instruct, GLM-4-9B, and Qwen-VL-7B (Bai et al., 2023 ) (Fig. 14, Table 6).
- Autoencoder: The proposed VAE achieves lower FID scores than VAE-D (diffusion loss based) on image generation (Table 5).

Extended Applications

Wan's foundational capabilities are extended to several downstream tasks:

Image-to-Video Generation (I2V): Synthesizes video from a static image and text prompt. Model design extends the T2V framework by concatenating conditional latent, mask, and global image context (from CLIP image encoder (Hatamizadeh et al., 2021 )) (Fig. 15). Masking mechanism supports various tasks (I2V, continuation, first-last frame transformation, interpolation). Dataset is filtered for first-frame-to-video consistency. SFT incorporates an image encoder for global context. Human evaluation shows favorable performance compared to SOTA (Table 7, Fig. 16, 17).
Unified Video Editing: Builds on VACE (Jiang et al., 10 Mar 2025 ) framework for controllable generation and editing. VCU unifies diverse inputs (text, video frames, masks). Concept decoupling separates modifiable ( $F_c$ ) and preserved ( $F_k$ ) pixels based on masks (Fig. 18). Supports full fine-tuning or Context Adapter Tuning (pluggable Res-Tuning (Wu et al., 2023 )). Data construction involves shot slicing, instance-level analysis (RAM (Zhang et al., 2023 ), Grounding DINO (Liu et al., 2023 ), SAM2 [2025]), and task-specific tailoring. Exhibits high video quality, temporal consistency, and supports task combinations (Fig. 19, 20, 21).
Text-to-Image Generation (T2I): Joint training on extensive image data results in exceptional T2I performance (Fig. 22), showcasing cross-modal knowledge transfer.
Video Personalization: Generates videos maintaining identity from a reference image. Conditions generation directly on the VAE latent of the segmented face image (extracted from paired video) concatenated with extended frames and masks (Fig. 23). Training randomly drops face images to support 0-K references. Dataset includes filtered human videos and automatically synthesized faces (Instant-ID (Wang et al., 15 Jan 2024 )). Achieves competitive ID-fidelity (ArcFace similarity) compared to commercial competitors (Table 8, Fig. 24).
Camera Motion Controllability: Guides video generation using camera trajectories. Processes extrinsic/intrinsic parameters into fine-grained positions (Plucker coordinates) and encodes them using a Camera Pose Encoder (convolutional modules). A Camera Pose Adapter integrates features into DiT blocks via adaptive normalization (Fig. 25). Uses VGG-SfM [2024] to extract trajectories from training videos. Shows effective camera motion guidance (Fig. 26).
Real-time Video Generation: Adapts the fixed-length T2V model to a real-time streaming pipeline. Streamer uses a sliding temporal window and denoising queue for infinite-length generation (Fig. 27, 28). Continuity is ensured by reintroducing cached tokens. Integrates with Latent Consistency Models (LCM/VideoLCM (Luo et al., 2023 , Wang et al., 2023 )) for 10-20x acceleration (8-16 FPS). Quantization (int8, TensorRT [2023]) optimizes for consumer GPUs.
Audio Generation (V2A): Synthesizes synchronized soundtracks (ambient sound, music) for videos. Uses a DiT based on flow-matching. A 1D-VAE compresses raw waveforms to preserve temporal information (Fig. 29). CLIP extracts visual embeddings, temporally aligned with audio features. umT5 encodes text. Multimodal fusion via summation. Data is filtered to remove speech/vocals, supplemented with dense video + audio captions (ambient, music style) generated by Qwen2-audio (Chu et al., 15 Jul 2024 ). Random caption masking encourages learning from visual cues. Generates stereo audio (44.1 kHz, max 12s). Shows superior long-term consistency, cleaner audio, and better rhythm synthesis than MMAudio (Cheng et al., 19 Dec 2024 ) (Fig. 30). Currently limited in generating human vocal sounds.

Limitations and Conclusion

Limitations: Challenges remain in preserving fine-grained details during large motion; high computational cost of 14B model inference (~30 mins on a single GPU without optimization) hinders accessibility; insufficient domain-specific expertise.
Conclusion: Wan sets a new benchmark, demonstrating advancements in motion amplitude, instruction following, and visual text generation. Detailed insights into architecture, data, training, evaluation, and applications are provided. The 1.3B model offers consumer-grade efficiency. Open-sourcing aims to foster community development and address limitations. Future work will focus on scaling data and models for improved fidelity in complex scenarios and broader accessibility.