Papers
Topics
Authors
Recent
2000 character limit reached

Wan-VAE: Spatio-Temporal VAE Framework

Updated 19 November 2025
  • Spatio-temporal Variational Autoencoders, including Wan-VAE, are generative models designed to capture both spatial and temporal dependencies in video and image time series.
  • Wan-VAE uses a U-Net architecture with 3D causal convolutions and RMSNorm for efficient latent compression and robust unsupervised representation learning.
  • The model achieves high PSNR and faster inference, making it effective for video compression, representation learning, and integration with diffusion transformer frameworks.

Spatio-temporal Variational Autoencoders (VAEs) encompass architectures designed to model correlated high-dimensional data with both spatial and temporal structure, particularly relevant for video and image time series. This entry highlights two major frameworks at the forefront of the field: the tensor-variate Gaussian Process Prior VAE (tvGP-VAE) (Campbell et al., 2020), and the application within the Wan video foundation model (Wan-VAE) (Wan et al., 26 Mar 2025). Both formulations enable deep generative modeling, unsupervised representation learning, and efficient video compression by explicitly learning dependencies across space and time under principled probabilistic formalisms.

1. Generative Modeling Frameworks

tvGP-VAE formulates the generative process for spatio-temporal data XRC×W×H×TX \in \mathbb{R}^{C \times W \times H \times T} as a latent variable model with structured prior distributions on a lower-dimensional tensor ZRK×W×H×TZ \in \mathbb{R}^{K \times W' \times H' \times T'}, where KCK \ll C, W<WW' < W, H<HH' < H, and T<TT' < T (Campbell et al., 2020). The prior pθ(Z)p_\theta(Z) adopts a tensor-normal (tensor-GP) distribution:

pθ(Z)=k=1Kpθ(Zk),ZkRW×H×Tp_\theta(Z) = \prod_{k=1}^K p_\theta(Z_k), \quad Z_k \in \mathbb{R}^{W' \times H' \times T'}

where each pθ(Zk)=TN(Zk;0,Ωk(1),Ωk(2),Ωk(3))p_\theta(Z_k) = \mathcal{TN}(Z_k; 0, \Omega_k^{(1)}, \Omega_k^{(2)}, \Omega_k^{(3)}) incorporates Kronecker-factorized covariances Ωk(m)\Omega_k^{(m)} parameterized via kernel functions κk(m)\kappa_k^{(m)} with hyperparameters λk(m)\lambda_k^{(m)}.

Wan-VAE operates on video tensors xR(1+T)×H×W×3x \in \mathbb{R}^{(1+T) \times H \times W \times 3}, compressing them to latent codes zRC×(1+T/4)×(H/8)×(W/8)z \in \mathbb{R}^{C \times (1+T/4) \times (H/8) \times (W/8)} using an encoder and reconstructing with a symmetric decoder, facilitating spatio-temporal compression by factors of 4 (time) and 8 (space) (Wan et al., 26 Mar 2025). The architecture realizes a U-Net with 3D causal convolutions and residual connections, ensuring strict temporal causality and spatial fidelity.

2. Encoder and Posterior Distribution

In tvGP-VAE, amortized variational inference employs an encoder network fϕ(X)f_\phi(X) yielding parameters for the approximate posterior qϕ(ZX)=k=1Kqϕ(ZkX)q_\phi(Z \mid X) = \prod_{k=1}^K q_\phi(Z_k \mid X), where each qϕ(ZkX)q_\phi(Z_k \mid X) is a tensor-normal distribution:

qϕ(ZkX)=TN(Zk;Mk(X),Σk(1)(X),Σk(2)(X),Σk(3)(X))q_\phi(Z_k \mid X) = \mathcal{TN}\big(Z_k; M_k(X), \Sigma_k^{(1)}(X), \Sigma_k^{(2)}(X), \Sigma_k^{(3)}(X)\big)

with mean tensor Mk(X)M_k(X) factorized as mk(1)(X)mk(2)(X)mk(3)(X)m_k^{(1)}(X) \circ m_k^{(2)}(X) \circ m_k^{(3)}(X) and covariances Σk(m)(X)\Sigma_k^{(m)}(X) parameterized via sparse bidiagonal precision matrices for computational efficiency.

Wan-VAE encodes video in “chunks” of four frames using 3D causal convolutional downsampling blocks and RMSNorm (replacing GroupNorm) to conserve temporal causality and support memory-efficient inference. The encoder outputs a sequence of latent tensors, capturing coarse-grained motion and high-level appearance within each chunk. The first frame is solely spatially downsampled to better accommodate still-image statistics.

3. Covariance Structures and Kernel Choices

A distinguishing feature of tvGP-VAE is its explicit modeling of multi-mode correlation via Kronecker-separable covariances in both prior and posterior. For latent ZkZ_k:

Cov(vec(Zk))=Σk(3)Σk(2)Σk(1) (posterior),Ωk(3)Ωk(2)Ωk(1) (prior)\operatorname{Cov}(\operatorname{vec}(Z_k)) = \Sigma_k^{(3)} \otimes \Sigma_k^{(2)} \otimes \Sigma_k^{(1)} \ \text{(posterior)}, \quad \Omega_k^{(3)} \otimes \Omega_k^{(2)} \otimes \Omega_k^{(1)} \ \text{(prior)}

The covariance entries adopt the exponentiated-quadratic (RBF) kernel:

κk(m)(i,i)=(σk(m))2exp[(ii)22[k(m)]2]\kappa_k^{(m)}(i, i') = (\sigma_k^{(m)})^2 \exp \left[-\frac{(i-i')^2}{2[\ell_k^{(m)}]^2}\right]

where σk(m)\sigma_k^{(m)} and k(m)\ell_k^{(m)} respectively control marginal variance and correlation length-scale.

Wan-VAE, while not employing explicit GP priors, leverages stacked causal convolutions in time and space to implicitly learn spatio-temporal dependencies through deep network weights. The architecture ensures that each latent code represents temporally coherent features without future information leakage.

4. Optimization Objectives and Training Protocols

tvGP-VAE optimizes the standard ELBO:

L(θ,ϕ)=Eqϕ(ZX)[logpθ(XZ)]KL(qϕ(ZX)p(Z))\mathcal{L}(\theta, \phi) = \mathbb{E}_{q_\phi(Z|X)}[\log p_\theta(X|Z)] - KL(q_\phi(Z|X) \Vert p(Z))

with closed-form KL divergence terms derived for tensor-normal distributions via Kronecker factors and bidiagonal posterior precision. Adam (lr = 1×1031 \times 10^{-3}, batch size 50) with early stopping on validation ELBO is used for learning (Campbell et al., 2020).

Wan-VAE’s learning consists of a three-stage process: training a 2D image VAE, inflating to 3D, and fine-tuning with a GAN loss to balance spatial fidelity and temporal smoothness. The VAE loss is:

LVAE=λ1Eqϕ(zx)[xx^1]+λpercEqϕ(zx)[LPIPS(x,x^)]+λKLKL(qϕ(zx)p(z))\mathcal{L}_{\rm VAE} = \lambda_{\ell_1}\,\mathbb{E}_{q_\phi(z|x)}[\|x - \hat{x}\|_1] + \lambda_{\rm perc}\,\mathbb{E}_{q_\phi(z|x)}[\mathrm{LPIPS}(x, \hat{x})] + \lambda_{\rm KL}\,\mathrm{KL}(q_\phi(z|x)\,\|\,p(z))

with λ1=3\lambda_{\ell_1}=3, λperc=3\lambda_{\rm perc}=3, λKL=3×106\lambda_{\rm KL}=3 \times 10^{-6} (Wan et al., 26 Mar 2025). The causal convolutions enforce continuity, while the GAN loss in stage 3 enhances high-frequency details.

5. Spatio-Temporal Inference, Causality, and Implementation

tvGP-VAE utilizes amortized inference and the reparameterization trick for latent tensor sampling, drawing independent standard-Gaussian noise and propagating through sparse factor matrices. Both encoder and decoder are built from spatial and temporal convolutional layers, with posterior parameters extracted via small fully-connected heads. All kernel hyperparameters and network weights undergo joint gradient-based optimization.

Wan-VAE adopts chunked feature-cache inference, enabling efficient processing of very long videos by sequentially handling “chunks” of four frames, with continuity maintained across boundaries via RMSNorm. Data augmentation (random crops, flips, color jitter), extensive ablation (e.g., Table 4.2: T2I FID increases from 40.55 to 41.16 when replacing Wan-VAE with a diffusion-VAE), and empirical benchmarks reinforce model robustness and speed. Wan-VAE achieves PSNR \sim33 dB at 720×720720 \times 720 px (25 fps), operating 2.5×\times faster than prior SOTA (Wan et al., 26 Mar 2025).

6. Integration with Diffusion Transformers and Downstream Applications

Within the Wan ecosystem, the latent code zz extracted by Wan-VAE is “patchified” into sequence tokens via a (1,2,2) 3D convolution, flattened as xRB×L×Dx^\top \in \mathbb{R}^{B \times L \times D}, and fed to a video-DiT diffusion backbone. Flow-matching diffusion is operated:

xt=tx1+(1t)x0,vt=x1x0x_t = t\,x_1 + (1-t)\,x_0, \qquad v_t = x_1 - x_0

and the corresponding DiT block minimizes:

Ex0,x1,t  u(xt,ctxt,t;θ)vt  2\mathbb{E}_{x_0, x_1, t} \|\;u(x_t, c_{\rm txt}, t; \theta) - v_t\;\|^2

where ctxtc_{\rm txt} denotes umT5 text embeddings. This structure underpins downstream tasks such as image-to-video, instruction-guided video editing, and personal video generation, spanning eight task domains at multiple model scales (1.3B and 14B parameters). All models and source code are open-source (Wan et al., 26 Mar 2025).

7. Comparative Evaluation and Impact

Both tvGP-VAE and Wan-VAE manifest significant advances in spatio-temporal generative modeling. tvGP-VAE demonstrates the importance of latent correlation structures for reconstruction quality in high-dimensional sequence data (Campbell et al., 2020). Wan-VAE, as a lightweight (127M parameters) front-end for large-scale diffusion transformers, delivers high PSNR (\sim33 dB) and reduced inference latency (2.5×\times speed-up), validated by superior reconstruction metrics and ablation studies (Wan et al., 26 Mar 2025). A plausible implication is that architectural innovations such as causal convolutions, Kronecker-separable kernel parameterizations, and chunked caching are critical for scalable, efficient, and high-fidelity video generation.

Summary Table: Core Attributes

Model Latent Structure Spatio-temporal Compression Key Technical Innovations
tvGP-VAE Tensor-normal GP prior, Kronecker cov Explicit kernel modeling of width, height, time Closed-form KL; sparse bidiagonal precision
Wan-VAE Latent z via 3D causal conv U-Net Temporal (×4), Spatial (×8) RMSNorm; chunked feature-cache

Spatio-temporal VAEs are central in advancing unsupervised video representation learning and generation, enabling accurate modeling of dependencies, resource-efficient deployment, and robust integration with diffusion transformer frameworks for state-of-the-art applications in image and video synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Spatio-temporal Variational Autoencoder (Wan-VAE).