Wan 2.2: SOTA Video Diffusion Model

Updated 12 December 2025

Wan 2.2 is a state-of-the-art video diffusion model suite that integrates a spatio-temporal VAE and a diffusion transformer to enable robust generative video modeling.
It offers two key variants—Wan-1.3B for resource-efficient usage and Wan-14B for maximal fidelity—supporting diverse applications like video editing and real-time streaming.
The model leverages rigorous data curation, large-scale GPU training, and innovative inference techniques to achieve competitive FVD scores and enhanced video quality.

Wan 2.2 is a large-scale, open-source video diffusion model suite developed as part of the Wan project, advancing the field of generative video modeling. Building upon the diffusion transformer paradigm, Wan 2.2 integrates architectural innovations in spatio-temporal encoding, scalable pretraining strategies, model efficiency, and flexible downstream applications. It supports both research and production use through two principal variants: Wan-1.3B (efficient, resource-conscious) and Wan-14B (maximum fidelity and scale). The suite is open-sourced, with all models and training scripts available for academic and commercial exploration (Wan et al., 26 Mar 2025).

1. Core Model Architecture

Spatio-Temporal VAE ("Wan-VAE")

Wan-VAE is a critical backbone component, compressing high-resolution videos into a low-dimensional latent space for subsequent diffusion-based modeling. The compression pipeline is as follows:

Input: $(1+T) \times H \times W \times 3$ pixels (T: number of frames).
Output: $x \in \mathbb{R}^{(1+\tfrac{T}{4}) \times \frac{H}{8} \times \frac{W}{8} \times C}$ , $C=16$ .
Causality: 3D causal convolutions prevent temporal leakage ("future" to "past").
Normalization: RMSNorm is used instead of GroupNorm, enabling feature-caching and memory efficiency during inference.

The VAE is trained in three phases:

2D-VAE pretraining on images ( $L_1$ , KL, LPIPS losses).
3D inflation and training on short (5-frame, 128×128) videos.
Resolution-varied fine-tuning with GAN loss.

The reconstruction loss is: $\mathcal{L}_{\text{VAE}} = \lambda_1 \|V - \hat V\|_1 + \lambda_{\mathrm{KL}} D_{\mathrm{KL}}\left(q(z|V)\,\|\,p(z)\right) + \lambda_{\mathrm{LPIPS}} \mathrm{LPIPS}(V, \hat V)$ where $\lambda_1=3$ , $\lambda_{\mathrm{KL}}=3\times 10^{-6}$ , $\lambda_{\mathrm{LPIPS}}=3$ .

Diffusion Transformer Backbone ("DiT")

Within the latent space, conditional diffusion operates via a transformer-based network:

Forward perturbation: $x_t = t\,x_1 + (1-t)\,x_0$ , $x_0 \sim \mathcal{N}(0, I)$ , $t \in [0, 1]$ .
Target velocity: $v_t = \frac{d x_t}{dt} = x_1 - x_0$ .
The model $u_\theta(x_t, c, t)$ predicts $v_t$ , using a denoising loss: $\mathcal{L}_{\mathrm{FM}} = \mathbb{E}_{x_0, x_1, t, c}\bigl\|\;u_\theta(x_t, c, t)\;-\;v_t \bigr\|^2$
Patchify: 3D convolution splits latent into tokens; flatten for transformer processing.
Transformer design: blockwise self-attention, cross-attention to text (umT5 5.3B, 512 tokens), two-layer MLPs, and shared AdaLN time-modulation ( $\gamma$ , $\beta$ parameters).
Time schedule: uses flow matching (rectified flows) instead of classical DDPM noise schedules.

Model sizes and configurations:

Variant	# Transformer blocks	Hidden dim	# Heads	VAE size	Total params
Wan-1.3B	30	1,536 ?	16 ?	127M	1.3B
Wan-14B	–	–	–	127M	14B

Feature-cache inference enables memory-efficient video encoding/decoding by chunking and tracking convolutional state.

2. Data Curation, Scaling Laws, and Pretraining

Wan 2.2 achieves state-of-the-art results through empirical scaling and rigorous data curation:

Scaling law: FVD follows a power law in model size ( $N$ ) and dataset size ( $D$ ), i.e., $\mathrm{FVD} \propto N^{-\alpha} D^{-\beta}, \alpha \approx 0.15, \beta \approx 0.08$ .
Training set: $\mathcal{O}(10^9)$ images and videos, ~1 trillion tokens.
Data curation:
- OCR/text coverage, NSFW filter, de-duplication, blur and exposure assessment.
- Clustering and expert scoring for visual and motion quality.
- Tiers assigned for camera motion quality.
- Visual-text linking via synthesized captions (Chinese/English), OCR, and Qwen2-VL model generation.
- Dense captioning based on LLaVA-style ViT+Qwen LLM.
Progression in data augmentation and pretraining:
- Stage I: 256px images — T2I only.
- Stage II: 256px images, 192px videos, 16 FPS, 5s.
- Stage III: 480px images and videos.
- Stage IV: 720px images and videos (final pretrain).

3. Training Infrastructure and Efficiency

Wan 2.2 provides training and deployment infrastructure optimized for both large-scale clusters and consumer hardware:

GPU clusters: A100/H100.
Peak VRAM: Wan-1.3B requires 8.19 GB (batch size 1, mixed-precision); Wan-14B would require ∼8 TB if naively stored (solved by activation offload + gradient checkpointing).
Key hyperparameters:

$\text{AdamW}\, (\beta_1=0.9,\,\beta_2=0.99),\, \text{weight decay}=10^{-3},\, \text{LR}=1\times10^{-4}$
Sampling: 50 steps per video, unified via flow-matching scheme.
Batch size: 1,536 (global, mixed image and video).
Inference: optimization via 2D context parallelism (scalable to 128 GPUs), diffusion caching (1.62× speedup), FP8 GEMM and int8 FlashAttention (95% MFU), and memory-aware chunk processing. Consumer deployment uses the 1.3B model with FP8/int8 via TorchAO/TensorRT.

4. Evaluation Metrics and Empirical Results

Evaluation uses standard and custom metrics:

FVD (Fréchet Video Distance): sample quality, as per Unterthiner et al. (2018).
KID (Kernel Inception Distance): unbiased MMD in Inception space.
CLIPSIM: cosine similarity of CLIP embeddings, $\mathrm{CLIPSIM}(x, y) = \frac{E_{\mathrm{CLIP}}(x)\cdot E_{\mathrm{CLIP}}(y)}{\|E(x)\|\,\|E(y)\|}$ .

Benchmarking (Table 1, Table 2 in (Wan et al., 26 Mar 2025)):

Model	Wan-Bench ↓	Weighted Score
Wan-14B	0.724	0.724
Sora (OpenAI)	0.700	0.700
CN-TopA	0.693	0.693
HunyuanVideo	0.673	0.673
Mochi	0.639	0.639

VBench (public leaderboard):

Model	Quality	Semantic	Total
Wan-14B	86.67%	84.44%	86.22%
OpenAI Sora	85.51%	79.35%	84.28%
Wan-1.3B	84.92%	80.10%	83.96%

Qualitative examples (Fig. 2, Fig. 17): demonstrate text rendering, cinematic style, and high-motion sequences.

5. Downstream Tasks and Modularity

Wan 2.2 supports multiple downstream applications and is architected for modular extension:

Image-to-Video (Wan-I2V): First frame and mask are concatenated in latent space, passed through DiT.
Video Editing (VACE): Uses a "Video Condition Unit" with pixelwise concept decoupling and context adapters.
Personalization: K face latents + all-ones masks prepended; enables robust video inpainting.
Camera Motion Control: Camera extrinsics/intrinsics are encoded as Plücker coordinates and injected via PixelUnshuffle and adaptive-norm scaling.
Real-Time Streaming (Streamer+LCM): Sliding window denoising with latent consistency distillation (8 FPS on 8×A100, 20 FPS on single 4090).
Video-to-Audio: 1D-VAE on waveforms, DiT fusing CLIP frames & umT5 text to produce mel-scale reconstructions.

6. Open-Source Release and Research Usage

Wan 2.2 is distributed with a comprehensive set of resources:

Codebase and pretrained models for both 1.3B and 14B variants: https://github.com/Wan-Video/Wan2.1.
APIs for text2video, image2video, video_edit, personal_video, camera_motion, and real_time_stream.
Documentation recommends freezing the VAE/text encoder and tuning DiT/adapters for small-scale tasks.
Large-scale training is facilitated via FSDP and 2D context parallelism.
Fine-tuning and deployment best practices include prompt rewriting (Qwen2.5), TorchAO/TensorRT quantization.

Wan 2.2 thus provides a state-of-the-art, large-scale, modular, and efficient platform for video generation and editing, setting empirical benchmarks for both performance and scalability in the open video diffusion literature (Wan et al., 26 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

Wan: Open and Advanced Large-Scale Video Generative Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SOTA Video Diffusion Model (Wan 2.2).