Papers
Topics
Authors
Recent
2000 character limit reached

Wan2.1 Model: Video Diffusion Transformers

Updated 14 February 2026
  • Wan2.1 model is a family of large-scale, transformer-based latent diffusion architectures designed for high-resolution video generation, editing, and super-resolution.
  • It employs multi-head spatio-temporal self-attention over patchified video representations, integrating frozen text encoders and external modules like ControlNet for improved detail and consistency.
  • Adaptation via LoRA modules and proxy-based step skipping enables efficient cinematic domain transfer, emotion-aligned synthesis, and accelerated sampling with minimal quality loss.

The Wan2.1 model family comprises large-scale, transformer-based latent diffusion architectures for high-resolution video generation, editing, and super-resolution. Developed as latent-space Video Diffusion Transformers (DiTs), Wan2.1 models are structured for scalable text-to-video, image-to-video, and video-to-video tasks, featuring multi-head attention over patchified spatio-temporal video representations, frozen text encoders, and adaptive temporal conditioning mechanisms. Model variants are denoted according to parameter count, with Wan2.1-1.3B (1.3 billion parameters) and Wan2.1-14B (14 billion parameters) serving as canonical instances. This platform underpins advanced tasks such as cinematic domain adaptation, affective video synthesis, and 4K video super-resolution, and acts as a reference target for acceleration frameworks such as GalaxyDiT.

1. Core Architecture and Model Variants

Wan2.1 implements a transformer-based DiT backbone comprising a sequence of LL blocks (L40L \approx 40–$50$), each with a hidden dimension d{2048,4096+}d \in \{2048, 4096+\} and H16H \approx 16–32 attention heads per layer (Song et al., 3 Dec 2025). The model operates on noised video latents xtRTP2×dx_t \in \mathbb{R}^{T P^2 \times d} (where TT is the number of frames and P2P^2 the number of spatial patches per frame), which are patchified from the outputs of a causal 3D VAE encoder (Zhao et al., 25 Jul 2025). The backbone features:

  • Multi-head spatio-temporal self-attention and cross-attention to prompt encodings cc in every block.
  • Conditioning on continuous or learned diffusion timesteps tt via AdaLayerNorm (AdaLN).
  • Output at each step: a noise estimate ϵθ(xt,c)\epsilon_\theta(x_t, c), used to iteratively denoise xtx_t following the DDPM/SDE paradigm.

A frozen large text encoder (e.g., T5-based or Qwen-based) provides prompt embeddings. For video super-resolution and structure-preserving generation, Wan2.1 is often integrated with external modules such as ControlNet CPC (Consistency-Preserved ControlNet), which injects per-frame structural features at each block via residual scaling, with fusion formula:

XimainXimain+γFi/rCPC\mathbf{X}_i^{\mathrm{main}} \leftarrow \mathbf{X}_i^{\mathrm{main}} + \gamma\,\mathbf{F}_{\lfloor i/r\rfloor}^{\mathrm{CPC}}

where rr is a block index ratio and γ\gamma is learned (Zhao et al., 25 Jul 2025).

Major model variants differ by scale: | Variant | Parameters | Hidden Dim. | Use-case Example | |----------------|------------|-------------|---------------------------------| | Wan2.1-1.3B | 1.3B | ~2,048 | GalaxyDiT testbed, EmoVid | | Wan2.1-14B | 14B | ≥4,096 | Cinematic I2V, GalaxyDiT, VSR |

2. Diffusion Process and Conditioning

The forward latent diffusion process employed in Wan2.1 follows:

q(ztz0)=N(zt;αtz0,σt2I),zt=αtz0+σtϵ,ϵN(0,I)q(z_t \mid z_0) = \mathcal{N}\left(z_t ; \alpha_t z_0, \sigma_t^2 I \right), \quad z_t = \alpha_t z_0 + \sigma_t \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The model learns to invert this process iteratively, using a transformer-based denoiser parameterized as:

vθ(zt,t)ϵz0\mathbf{v}_\theta(z_t, t) \approx \epsilon - z_0

The network’s reverse denoising step is formulated as:

z^t1=1αt(ztσtvθ)+σt1ϵ\hat z_{t-1} = \frac{1}{\alpha_t}(z_t - \sigma_t \mathbf{v}_\theta) + \sigma_{t-1} \epsilon'

where ϵ\epsilon' is fresh Gaussian noise for each step.

Text-conditioned generation is achieved via cross-attention blocks, fusing embeddings cRK×dcc \in \mathbb{R}^{K \times d_c} at every layer. In tasks requiring emotion or domain adaptation, this embedding can be augmented by explicit tokens—e.g., emotion word appended to the caption for affective generation, or adapted via Low-Rank Adaptation (LoRA) modules for new domains (Qiu et al., 14 Nov 2025, Akarsu et al., 31 Oct 2025).

3. Adaptation and Fine-Tuning Methodologies

Wan2.1 supports specialized generative capabilities via lightweight adaptation:

  • Emotion Conditioning: EmoVid demonstrates that appending an explicit emotion word to the text prompt and applying LoRA modules (rank=32) to cross-attention projections enables emotion-aligned video synthesis without architectural modification (Qiu et al., 14 Nov 2025). No new classifier or adversarial losses are introduced; adaptation is driven by the standard diffusion reconstruction objective.
  • Cinematic Domain Transfer: For cinematic scene synthesis from limited data, LoRA adapters (rank=8, α\alpha=16) are inserted into cross-attention layers in select ViT encoder and temporal transformer decoder blocks. Adaptation is performed over the visual style (stage 1), followed by inference-time temporal expansion of stylistic keyframes into video sequences (stage 2) (Akarsu et al., 31 Oct 2025).
  • Super-Resolution Integration: RealisVSR integrates Wan2.1 with ControlNet CPC, injecting per-frame LR features for spatio-temporal consistency, and introduces specialized (wavelet + HOG) high-frequency losses for sharper texture recovery (Zhao et al., 25 Jul 2025).

Adaptation is typically driven by the standard diffusion loss:

Ldiffusion=Ex0,t,ϵϵϵθ(xt,t,c)22\mathcal{L}_{\mathrm{diffusion}} = \mathbb{E}_{x_0, t, \epsilon} \left\| \epsilon - \epsilon_\theta(x_t, t, c) \right\|_2^2

with optional additional objectives (e.g., temporal consistency loss, total high-frequency rectified loss for VSR).

4. Acceleration and Sampling Optimization

GalaxyDiT presents a training-free acceleration framework tailored to Wan2.1, structured around proxy-based step reuse and strict classifier-free guidance (CFG) alignment (Song et al., 3 Dec 2025).

EG(xt,c)=(1+γ)ϵθ(xt,c)γϵθ(xt,)\text{EG}(x_t, c) = (1 + \gamma) \epsilon_\theta(x_t, c) - \gamma \epsilon_\theta(x_t, \varnothing)

Doubling the compute at each diffusion step.

  • Proxy-Based Step Skipping: GalaxyDiT computes a lightweight proxy (e.g., cross-attention output of the first block) and accumulates a reuse metric per step. The reuse decision is driven by Spearman's ρ\rho correlation with an oracle importance metric, selecting the proxy yielding maximal rank correlation for each model variant (e.g., cross-attention-out for Wan2.1-1.3B with ρ0.89\rho \approx 0.89).
  • Guidance-Aligned Reuse: Both conditional and unconditional passes are co-reused if the reuse metric is below threshold and the step is past the initial 20% iterations, preventing CFG drift artifacts.

This strategy yields up to 2.37×2.37\times sampling speedup (Wan2.1-14B) with sub-1% fidelity drops on VBench-2.0. PSNR gains surpass prior step-skipping approaches (TeaCache) by $5$–$12$ dB, with notable improvements also in LPIPS and SSIM.

5. Application Domains and Benchmarking

Wan2.1’s architecture proves foundational across multiple advanced video generation scenarios:

  • Emotion-Centric Video Generation: Fine-tuned Wan2.1 on the EmoVid dataset yields significant gains in emotion alignment (EA-8cls: +4–5% absolute), with FVD and CLIPScore improvements in both text-to-video (T2V) and image-to-video (I2V) tasks (Qiu et al., 14 Nov 2025). Appended emotion tokens in prompts are sufficient for alignment without auxiliary modules.
  • Cinematic Scene Synthesis: LoRA-based adaptation enables few-shot learning of historical and cinematic visual domains, preserving spatial style and temporal coherence, with nearly 2×\times parallelized inference speedup and minimal perceptual quality loss (LPIPS delta <0.002) (Akarsu et al., 31 Oct 2025).
  • 4K Video Super-Resolution: In RealisVSR, Wan2.1 with CPC, HR-Loss, and text conditioning achieves detail recovery on ultra-high-resolution 4K benchmarks, outperforming prior GAN- and diffusion-based VSR baselines in both quantitative (FVD, PSNR, LPIPS) and qualitative measures (Zhao et al., 25 Jul 2025).

6. Implementation, Training, and Evaluation Details

Key implementation strategies and empirical results for Wan2.1 deployments:

  • Optimization: AdamW optimizer with learning rates 1×1041 \times 10^{-4} (Qiu et al., 14 Nov 2025) or 3×1053\times 10^{-5} (Akarsu et al., 31 Oct 2025), weight decay 0.01\sim0.01, batch size $1$, epochs $3$–(early stopping), precision bf16, and activation checkpointing as standard practice.
  • Hardware: 1×\times Nvidia H20 or A100 (40–80 GB) shown sufficient for LoRA-based adaptation in hours.
  • Regularization: Reliance on LoRA adaption limits overfitting, with no adversarial, classifier, or emotion-specific regularizers in reported deployments.
  • Evaluation Metrics: Fréchet Video Distance (FVD), CLIP-SIM, LPIPS, SSIM, and domain-specific metrics such as EA-2cls/EA-8cls (emotion accuracy) and VBench-2.0 for prompt/video fidelity (Qiu et al., 14 Nov 2025, Akarsu et al., 31 Oct 2025, Song et al., 3 Dec 2025).
  • Software Ecosystem: PyTorch, DeepSpeed, DiffSynth Studio, OpenCLIP for rapid prototyping and deployment.
  • Pipeline Engineering: Inference parallelized via temporal sharding and Fully Sharded Data Parallelism (FSDP), with custom overlap blending at shard boundaries to ensure temporal continuity (Akarsu et al., 31 Oct 2025).

7. Summary of Benchmarks and Reported Results

Wan2.1 achieves state-of-the-art status as a backbone for a wide spectrum of video generative tasks.

Task/Metric Baseline Wan2.1/Adapted Variant Quantified Outcome Reference
VBench-2.0 fidelity (%) 58.36 (Wan2.1-14B base) 57.64 (GalaxyDiT-fast, 2.37×\times) 0.72%-0.72\% delta, +12+12dB PSNR, +27%+27\% SSIM (Song et al., 3 Dec 2025)
Emotion (EA-8cls, T2V) 44.16 (before) 48.33 (after EmoVid fine-tune) +4.2%+4.2\% absolute (Qiu et al., 14 Nov 2025)
I2V LPIPS 0.0325 (base) 0.0324 (after EmoVid fine-tune) minimal impact on perceptual similarity (Qiu et al., 14 Nov 2025)
Cinematic LPIPS 0.142 <<0.002 degradation after sharded inf. expert user +1.2+1.2 mean rating (p<<0.05) (Akarsu et al., 31 Oct 2025)

The Wan2.1 model family demonstrates high scalability, modularity for conditional adaptation, and competitive efficiency when paired with modern acceleration techniques. Its transformer-based block design and latent video diffusion framework form a technical foundation for high-fidelity, flexible, and rapid video generative modeling across diverse domains.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wan2.1 Model.