Wan-VAE: Spatio-Temporal VAE Framework

Updated 19 November 2025

Spatio-temporal Variational Autoencoders, including Wan-VAE, are generative models designed to capture both spatial and temporal dependencies in video and image time series.
Wan-VAE uses a U-Net architecture with 3D causal convolutions and RMSNorm for efficient latent compression and robust unsupervised representation learning.
The model achieves high PSNR and faster inference, making it effective for video compression, representation learning, and integration with diffusion transformer frameworks.

Spatio-temporal Variational Autoencoders (VAEs) encompass architectures designed to model correlated high-dimensional data with both spatial and temporal structure, particularly relevant for video and image time series. This entry highlights two major frameworks at the forefront of the field: the tensor-variate Gaussian Process Prior VAE (tvGP-VAE) (Campbell et al., 2020), and the application within the Wan video foundation model (Wan-VAE) (Wan et al., 26 Mar 2025). Both formulations enable deep generative modeling, unsupervised representation learning, and efficient video compression by explicitly learning dependencies across space and time under principled probabilistic formalisms.

1. Generative Modeling Frameworks

tvGP-VAE formulates the generative process for spatio-temporal data $X \in \mathbb{R}^{C \times W \times H \times T}$ as a latent variable model with structured prior distributions on a lower-dimensional tensor $Z \in \mathbb{R}^{K \times W' \times H' \times T'}$ , where $K \ll C$ , $W' < W$ , $H' < H$ , and $T' < T$ (Campbell et al., 2020). The prior $p_\theta(Z)$ adopts a tensor-normal (tensor-GP) distribution:

$p_\theta(Z) = \prod_{k=1}^K p_\theta(Z_k), \quad Z_k \in \mathbb{R}^{W' \times H' \times T'}$

where each $p_\theta(Z_k) = \mathcal{TN}(Z_k; 0, \Omega_k^{(1)}, \Omega_k^{(2)}, \Omega_k^{(3)})$ incorporates Kronecker-factorized covariances $\Omega_k^{(m)}$ parameterized via kernel functions $\kappa_k^{(m)}$ with hyperparameters $\lambda_k^{(m)}$ .

Wan-VAE operates on video tensors $x \in \mathbb{R}^{(1+T) \times H \times W \times 3}$ , compressing them to latent codes $z \in \mathbb{R}^{C \times (1+T/4) \times (H/8) \times (W/8)}$ using an encoder and reconstructing with a symmetric decoder, facilitating spatio-temporal compression by factors of 4 (time) and 8 (space) (Wan et al., 26 Mar 2025). The architecture realizes a U-Net with 3D causal convolutions and residual connections, ensuring strict temporal causality and spatial fidelity.

2. Encoder and Posterior Distribution

In tvGP-VAE, amortized variational inference employs an encoder network $f_\phi(X)$ yielding parameters for the approximate posterior $q_\phi(Z \mid X) = \prod_{k=1}^K q_\phi(Z_k \mid X)$ , where each $q_\phi(Z_k \mid X)$ is a tensor-normal distribution:

$q_\phi(Z_k \mid X) = \mathcal{TN}\big(Z_k; M_k(X), \Sigma_k^{(1)}(X), \Sigma_k^{(2)}(X), \Sigma_k^{(3)}(X)\big)$

with mean tensor $M_k(X)$ factorized as $m_k^{(1)}(X) \circ m_k^{(2)}(X) \circ m_k^{(3)}(X)$ and covariances $\Sigma_k^{(m)}(X)$ parameterized via sparse bidiagonal precision matrices for computational efficiency.

Wan-VAE encodes video in “chunks” of four frames using 3D causal convolutional downsampling blocks and RMSNorm (replacing GroupNorm) to conserve temporal causality and support memory-efficient inference. The encoder outputs a sequence of latent tensors, capturing coarse-grained motion and high-level appearance within each chunk. The first frame is solely spatially downsampled to better accommodate still-image statistics.

3. Covariance Structures and Kernel Choices

A distinguishing feature of tvGP-VAE is its explicit modeling of multi-mode correlation via Kronecker-separable covariances in both prior and posterior. For latent $Z_k$ :

$\operatorname{Cov}(\operatorname{vec}(Z_k)) = \Sigma_k^{(3)} \otimes \Sigma_k^{(2)} \otimes \Sigma_k^{(1)} \ \text{(posterior)}, \quad \Omega_k^{(3)} \otimes \Omega_k^{(2)} \otimes \Omega_k^{(1)} \ \text{(prior)}$

The covariance entries adopt the exponentiated-quadratic (RBF) kernel:

$\kappa_k^{(m)}(i, i') = (\sigma_k^{(m)})^2 \exp \left[-\frac{(i-i')^2}{2[\ell_k^{(m)}]^2}\right]$

where $\sigma_k^{(m)}$ and $\ell_k^{(m)}$ respectively control marginal variance and correlation length-scale.

Wan-VAE, while not employing explicit GP priors, leverages stacked causal convolutions in time and space to implicitly learn spatio-temporal dependencies through deep network weights. The architecture ensures that each latent code represents temporally coherent features without future information leakage.

4. Optimization Objectives and Training Protocols

tvGP-VAE optimizes the standard ELBO:

$\mathcal{L}(\theta, \phi) = \mathbb{E}_{q_\phi(Z|X)}[\log p_\theta(X|Z)] - KL(q_\phi(Z|X) \Vert p(Z))$

with closed-form KL divergence terms derived for tensor-normal distributions via Kronecker factors and bidiagonal posterior precision. Adam (lr = $1 \times 10^{-3}$ , batch size 50) with early stopping on validation ELBO is used for learning (Campbell et al., 2020).

Wan-VAE’s learning consists of a three-stage process: training a 2D image VAE, inflating to 3D, and fine-tuning with a GAN loss to balance spatial fidelity and temporal smoothness. The VAE loss is:

$\mathcal{L}_{\rm VAE} = \lambda_{\ell_1}\,\mathbb{E}_{q_\phi(z|x)}[\|x - \hat{x}\|_1] + \lambda_{\rm perc}\,\mathbb{E}_{q_\phi(z|x)}[\mathrm{LPIPS}(x, \hat{x})] + \lambda_{\rm KL}\,\mathrm{KL}(q_\phi(z|x)\,\|\,p(z))$

with $\lambda_{\ell_1}=3$ , $\lambda_{\rm perc}=3$ , $\lambda_{\rm KL}=3 \times 10^{-6}$ (Wan et al., 26 Mar 2025). The causal convolutions enforce continuity, while the GAN loss in stage 3 enhances high-frequency details.

5. Spatio-Temporal Inference, Causality, and Implementation

tvGP-VAE utilizes amortized inference and the reparameterization trick for latent tensor sampling, drawing independent standard-Gaussian noise and propagating through sparse factor matrices. Both encoder and decoder are built from spatial and temporal convolutional layers, with posterior parameters extracted via small fully-connected heads. All kernel hyperparameters and network weights undergo joint gradient-based optimization.

Wan-VAE adopts chunked feature-cache inference, enabling efficient processing of very long videos by sequentially handling “chunks” of four frames, with continuity maintained across boundaries via RMSNorm. Data augmentation (random crops, flips, color jitter), extensive ablation (e.g., Table 4.2: T2I FID increases from 40.55 to 41.16 when replacing Wan-VAE with a diffusion-VAE), and empirical benchmarks reinforce model robustness and speed. Wan-VAE achieves PSNR $\sim$ 33 dB at $720 \times 720$ px (25 fps), operating 2.5 $\times$ faster than prior SOTA (Wan et al., 26 Mar 2025).

6. Integration with Diffusion Transformers and Downstream Applications

Within the Wan ecosystem, the latent code $z$ extracted by Wan-VAE is “patchified” into sequence tokens via a (1,2,2) 3D convolution, flattened as $x^\top \in \mathbb{R}^{B \times L \times D}$ , and fed to a video-DiT diffusion backbone. Flow-matching diffusion is operated:

$x_t = t\,x_1 + (1-t)\,x_0, \qquad v_t = x_1 - x_0$

and the corresponding DiT block minimizes:

$\mathbb{E}_{x_0, x_1, t} \|\;u(x_t, c_{\rm txt}, t; \theta) - v_t\;\|^2$

where $c_{\rm txt}$ denotes umT5 text embeddings. This structure underpins downstream tasks such as image-to-video, instruction-guided video editing, and personal video generation, spanning eight task domains at multiple model scales (1.3B and 14B parameters). All models and source code are open-source (Wan et al., 26 Mar 2025).

7. Comparative Evaluation and Impact

Both tvGP-VAE and Wan-VAE manifest significant advances in spatio-temporal generative modeling. tvGP-VAE demonstrates the importance of latent correlation structures for reconstruction quality in high-dimensional sequence data (Campbell et al., 2020). Wan-VAE, as a lightweight (127M parameters) front-end for large-scale diffusion transformers, delivers high PSNR ( $\sim$ 33 dB) and reduced inference latency (2.5 $\times$ speed-up), validated by superior reconstruction metrics and ablation studies (Wan et al., 26 Mar 2025). A plausible implication is that architectural innovations such as causal convolutions, Kronecker-separable kernel parameterizations, and chunked caching are critical for scalable, efficient, and high-fidelity video generation.

Summary Table: Core Attributes

Model	Latent Structure	Spatio-temporal Compression	Key Technical Innovations
tvGP-VAE	Tensor-normal GP prior, Kronecker cov	Explicit kernel modeling of width, height, time	Closed-form KL; sparse bidiagonal precision
Wan-VAE	Latent z via 3D causal conv U-Net	Temporal (×4), Spatial (×8)	RMSNorm; chunked feature-cache

Spatio-temporal VAEs are central in advancing unsupervised video representation learning and generation, enabling accurate modeling of dependencies, resource-efficient deployment, and robust integration with diffusion transformer frameworks for state-of-the-art applications in image and video synthesis.

PDF Markdown Chat (Pro)

References (2)

tvGP-VAE: Tensor-variate Gaussian Process Prior Variational Autoencoder (2020)

Wan: Open and Advanced Large-Scale Video Generative Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Spatio-temporal Variational Autoencoder (Wan-VAE).