Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tensor-variate GP Prior VAE

Updated 4 March 2026
  • The paper introduces a tensor-variate GP prior VAE that captures multi-dimensional latent correlations using Kronecker-structured covariances.
  • It employs mode-specific kernel functions and a generalized reparameterization for efficient variational inference, reducing computational overhead.
  • Experimental results on spatiotemporal datasets show enhanced spatial texture fidelity and improved temporal motion modeling.

The tensor-variate Gaussian Process Prior Variational Autoencoder (tvGP-VAE) is a structured extension of the variational autoencoder (VAE) framework that replaces independent, univariate latent priors and posteriors with tensor-variate Gaussian process (GP) distributions. This approach enables explicit modeling of latent correlation structures—such as spatial or temporal dependencies—via mode-specific kernel functions, allowing for the principled capture of multi-dimensional correlations in high-dimensional, structured data such as spatiotemporal image sequences (Campbell et al., 2020).

1. Motivation and Theoretical Foundations

Standard VAE implementations employ an independent standard Gaussian prior on the latent space, p(Z)=k=1KN(zk;0,1)p(Z) = \prod_{k=1}^K \mathcal{N}(z_k ; 0, 1), and a mean-field Gaussian variational posterior, q(ZX)=k=1KN(zk;μk(X),σk2(X))q(Z|X) = \prod_{k=1}^K \mathcal{N}(z_k ; \mu_k(X), \sigma_k^2(X)). While computationally efficient, this independence assumption neglects the inherent correlations often present in latent factors, particularly in structured data with dependencies across multiple axes (e.g., spatial or temporal). This can force the VAE’s decoder to model these correlations explicitly or induce posterior collapse, resulting in latent representations that inadequately capture structured variability (Campbell et al., 2020).

2. Tensor-variate Gaussian Process Prior Formulation

The core contribution of tvGP-VAE lies in adopting a tensor-GP prior for the latent variable. Specifically, each latent factor ZkRD1××DMZ_k \in \mathbb{R}^{D_1 \times \cdots \times D_M} is modeled as a tensor-valued Gaussian with a Kronecker-structured covariance across MM modes:

p(Zk)=TN(Zk;0,{Ωk(m)}m=1M),p(Z_k) = \mathcal{T}\mathcal{N}(Z_k; 0, \{\Omega_k^{(m)}\}_{m=1}^M),

where Ωk(m)RDm×Dm\Omega_k^{(m)} \in \mathbb{R}^{D_m \times D_m} is the mode-mm covariance, and TN\mathcal{T}\mathcal{N} denotes a tensor-normal distribution. By the vectorization-Kronecker property,

vec(Zk)N(0,Ωk(M)Ωk(1)).\text{vec}(Z_k) \sim \mathcal{N}\left(0, \Omega_k^{(M)} \otimes \cdots \otimes \Omega_k^{(1)}\right).

Kernel functions (e.g., squared exponential, Matérn) parameterize mode-wise covariance matrices, with hyperparameters such as kernel width \ell controlling the correlation lengthscale in each mode (Campbell et al., 2020).

3. Variational Inference and Posterior Structure

The inference model also adopts a tensor-GP form for each ZkZ_k, yielding a variational posterior:

q(ZkX)=TN(Zk;Mk(X),{Σk(m)(X)}m=1M),q(Z_k | X) = \mathcal{T}\mathcal{N}(Z_k; \mathcal{M}_k(X), \{\Sigma_k^{(m)}(X)\}_{m=1}^M),

where Mk(X)\mathcal{M}_k(X) is the mean tensor and Σk(m)(X)\Sigma_k^{(m)}(X) are mode-wise covariance factors. Under vectorization,

vec(Zk)N(vec(Mk),Σk(M)Σk(1)).\text{vec}(Z_k) \sim \mathcal{N}(\text{vec}(\mathcal{M}_k), \Sigma_k^{(M)} \otimes \cdots \otimes \Sigma_k^{(1)}).

Sampling from this posterior is achieved via a generalized reparametrization:

Zk=Mk+E×1Lk(1)×2×MLk(M),Z_k = \mathcal{M}_k + \mathcal{E} \times_1 L_k^{(1)} \times_2 \cdots \times_M L_k^{(M)},

where E\mathcal{E} is a tensor of i.i.d. standard normals, and Lk(m)L_k^{(m)} is the Cholesky factor of the mode-mm covariance (Campbell et al., 2020).

4. Learning Objective and Computational Considerations

The evidence lower bound (ELBO) for tvGP-VAE is expressed as:

L=Eq(ZX)[logp(XZ)]k=1KDKL[q(ZkX)p(Zk)],\mathcal{L} = \mathbb{E}_{q(Z|X)} [ \log p(X|Z) ] - \sum_{k=1}^K D_{KL}[q(Z_k|X) \,\,||\,\, p(Z_k)],

where the KL divergence—between Kronecker-Gaussian prior and posterior—admits a closed form:

KLk=12[m=1MTr(Ωk(m)1Σk(m))+vec(Mk)(m=M1Ωk(m)1)vec(Mk)+m=1M(logΩk(m)logΣk(m))D],\text{KL}_k = \frac{1}{2} \left[ \prod_{m=1}^M \text{Tr}( \Omega_k^{(m)-1} \Sigma_k^{(m)} ) + \text{vec}(\mathcal{M}_k)^\top \left( \bigotimes_{m=M}^1 \Omega_k^{(m)-1} \right) \text{vec}(\mathcal{M}_k) + \sum_{m=1}^M ( \log|\Omega_k^{(m)}| - \log|\Sigma_k^{(m)}| ) - D' \right],

with D=m=1MDmD' = \prod_{m=1}^M D_m. The Kronecker separability dramatically reduces computational overhead, allowing determinants and traces to be decomposed into mode-wise operations (Campbell et al., 2020).

5. Encoder-Decoder Architecture

The encoder network fϕ(X)f_\phi(X) produces, for each kk and each mode mm, a mean vector mk(m)m_k^{(m)} and Cholesky precision factors for Σk(m)\Sigma_k^{(m)}. The architectural backbone consists of convolutional operations: 2D convolutions downsample spatially per time step, followed by temporal 1D convolutions. Outputs are linearly projected into mode-wise latent parameters. The decoder gθ(Z)g_\theta(Z) aggregates all ZkZ_k along a channel dimension and applies sequences of transpose convolutions (spatial then temporal) to reconstruct input XX. The reconstruction distribution p(XZ)p(X|Z) can be Bernoulli or Gaussian, parameterized by gθg_\theta (Campbell et al., 2020).

6. Training Algorithm and Computational Strategies

Training proceeds as follows: (1) encoding to latent GP parameters, (2) sampling via the generalized reparametrization, (3) decoding to reconstruct inputs, (4) evaluating reconstruction and closed-form KL losses, and (5) optimizing the sum via stochastic backpropagation. Crucial computational efficiencies stem from: (a) AR(1)-structured precision matrices for mode-wise covariances; (b) low-rank, Kronecker-separable representations for both means and covariances; (c) avoidance of full covariance operations over the entire latent tensor. These strategies ensure that the computational complexity depends linearly or quadratically on per-mode dimensionality rather than exponentially on the total size (Campbell et al., 2020).

7. Experimental Results and Empirical Analysis

tvGP-VAE was evaluated on the Sprites dataset (animated image sequences), with comparison across: standard (0th order) VAE, temporal (1st order), spatial (2nd order), and spatio-temporal (3rd order) tvGP-VAE variants. Negative log-likelihoods (NLL, lower is better) were reported:

Model NLL (mean ± std)
Standard VAE 90,616 ± 3.8
Temporal tvGP-VAE 90,578 ± 6.2
Spatial tvGP-VAE 90,587 ± 11.3
Spatio-temporal tvGP-VAE 90,684 ± 15.7

Spatial kernels produced reconstructions with sharper texture fidelity, while temporal kernels more accurately modeled motion. The fully spatio-temporal variant yielded increased uncertainty in regions corresponding to movement, illustrating a trade-off between latent structure complexity and model certainty (Campbell et al., 2020).

The Kronecker-structured tensor-GP design in tvGP-VAE enables explicit inductive biases—by choosing which data dimensions receive kernel structure (e.g., only spatial, only temporal, or both). This flexibility allows practitioners to balance prior expressivity against overparameterization and posterior variance. Possible extensions include non-stationary or deep kernels, non-Gaussian likelihoods, non-Kronecker cross-mode kernels, and application to higher-order tensor data (e.g., 4D fMRI). A primary limitation is the increase in hyperparameters (kernels per mode per factor), which can complicate tuning and posterior calibration. Sparse or AR(1) precision structures keep computation tractable, but may restrict modeling of long-range dependencies (Campbell et al., 2020).

Closely related methodologies include models that deploy GP priors for latent spaces but operate with a less structured or factorized kernel design, such as the factorized GP-VAE (FGP-VAE) (Jazbec et al., 2020) and multi-modal GP-VAEs for medical imaging scenarios (Hamghalam et al., 2021). The distinguishing feature of tvGP-VAE is the explicit Kronecker tensor structure for GP priors over multi-dimensional modes—enabling efficient, scalable modeling of structured latent correlations within the VAE paradigm.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tensor-variate Gaussian Process Prior VAE (tvGP-VAE).