Papers
Topics
Authors
Recent
Search
2000 character limit reached

3D Causal Variational Autoencoder

Updated 23 June 2026
  • The model introduces a generative framework that leverages temporal structure and interventions to disentangle independent causal factors in high-dimensional 3D visual data.
  • It employs an autoencoder combined with a normalizing flow and a dynamic Bayesian network to effectively factorize and infer vector-valued latent variables.
  • The approach offers provable identifiability guarantees and demonstrates strong empirical performance on challenging 3D scenes and interventional domains.

A 3D Causal Variational Autoencoder (3D CausalVAE) is a generative framework designed for learning causal representations from sequential high-dimensional visual data, such as rendered image sequences, where the underlying latent causal factors may be both scalar and multidimensional (e.g., 3D positions and 3D rotations). The approach leverages the temporal structure of the data and a record of interventions to identify and disentangle the independent underlying causes—extending previous identifiability results to the setting of vector-valued causal factors. The CITRIS ("Causal Identifiability from Temporal Intervened Sequences") framework exemplifies a 3D CausalVAE by providing theoretical guarantees, a flexible neural implementation, and empirical validation on challenging 3D scenes (Lippe et al., 2022).

1. Generative Model Structure

The model assumes a latent dynamical system comprising KK causal factors Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K), where each Cit∈RMiC^t_i \in \mathbb{R}^{M_i} can be multidimensional (enabling, for example, vector-valued 3D rotations). Observed data Xt∈RNX^t \in \mathbb{R}^N at each timestep tt are generated deterministically via a bijective observation function hh plus observation noise EotE_o^t: Xt=h(Ct,Eot)X^t = h(C^t, E_o^t).

Interventions are specified by a binary vector It∈{0,1}KI^t \in \{0,1\}^K (with Iit=1I^t_i = 1 if factor Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)0 has been intervened upon), and a latent regime variable Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)1 may confound Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)2. The process is modeled as a Dynamic Bayesian Network, with parentage stipulated so that each Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)3 depends on a subset of Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)4 and its own intervention Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)5; Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)6 and Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)7 generate Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)8.

The joint density over Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)9 steps is factorized as:

Cit∈RMiC^t_i \in \mathbb{R}^{M_i}0

By exploiting the invertibility of Cit∈RMiC^t_i \in \mathbb{R}^{M_i}1 (with inverse Cit∈RMiC^t_i \in \mathbb{R}^{M_i}2), the model induces a decoder/likelihood and transition prior in the latent space Cit∈RMiC^t_i \in \mathbb{R}^{M_i}3. The one-step conditional likelihood is:

Cit∈RMiC^t_i \in \mathbb{R}^{M_i}4

The transition prior further factorizes over "blocks" of latents assigned to each causal factor and a "junk" block:

Cit∈RMiC^t_i \in \mathbb{R}^{M_i}5

with block assignments Cit∈RMiC^t_i \in \mathbb{R}^{M_i}6, and Cit∈RMiC^t_i \in \mathbb{R}^{M_i}7 by convention.

2. Inference, Variational Posterior, and Normalizing Flow

The variational posterior Cit∈RMiC^t_i \in \mathbb{R}^{M_i}8 is factored as a product of independent per-latent Gaussian distributions:

Cit∈RMiC^t_i \in \mathbb{R}^{M_i}9

For enhanced expressivity and disentanglement, the AE+NF (Autoencoder + Normalizing Flow) extension is used. Here, an autoencoder Xt∈RNX^t \in \mathbb{R}^N0 is first trained unrestrictedly; after training, these components are frozen and an invertible normalizing flow Xt∈RNX^t \in \mathbb{R}^N1 maps the autoencoder's embeddings Xt∈RNX^t \in \mathbb{R}^N2 to the latent variables Xt∈RNX^t \in \mathbb{R}^N3. The completed approximate posterior is defined as:

Xt∈RNX^t \in \mathbb{R}^N4

with a change-of-variables correction:

Xt∈RNX^t \in \mathbb{R}^N5

where Xt∈RNX^t \in \mathbb{R}^N6 and each Xt∈RNX^t \in \mathbb{R}^N7 is a coupling layer, using MAF/affine autoregressive transformations and interleaved normalization and invertible 1x1 convolutions (inspired by Glow).

3. Learning Objective and Block Assignment

Learning is performed by maximizing a variational lower bound (ELBO) on the conditional log-likelihood for each transition Xt∈RNX^t \in \mathbb{R}^N8:

Xt∈RNX^t \in \mathbb{R}^N9

The first term drives reconstructions, while the KL divergences align each causal-factor block (and the nuisance block) to the respective transition prior; tt0 encourages nuisance information to concentrate in block 0.

A target–classification (TC) loss further encourages tt1 to be selectively informative about its intervention target tt2 and invariant to others. This is implemented by a learned classifier tt3 that predicts tt4 from tt5, with gradients back-propagated selectively.

Block assignments tt6 for latent dimension tt7 are parameterized by a categorical variable over tt8, implemented with Gumbel-Softmax for sampling during training, and argmax assignment at test time.

4. Identifiability Result

Suppose:

  • tt9 is invertible,
  • the latent process is stationary, first-order Markov, with no instantaneous effects,
  • interventions hh0 are known, non-deterministic, and not always joint,
  • hh1, and latent dimension hh2 are sufficiently expressive.

Then, maximizing the conditional likelihood hh3 subject to maximizing entropy in block 0 provably recovers for each hh4 the minimal causal variable of hh5 (the component of hh6 which responds to intervention hh7), up to blockwise invertible transformations.

All equivalent maximizers correspond to assignments that only rearrange intervention-dependent components among blocks, but the entropy penalty on block 0 enforces a unique assignment. Identifiability is thus assured for multidimensional, intervention-targeted, temporal causal factors under the specified assumptions. Two factors that are always, or never, intervened upon jointly are not separable within this framework.

5. Neural Architecture and Training Procedure

The 3D CausalVAE instantiation in CITRIS employs:

  • Encoder hh8: 4 strided conv layers (stride 2) with 64 channels and hh9 kernels, BatchNorm+SiLU, and a final EotE_o^t0 conv, then flattened to yield EotE_o^t1, EotE_o^t2 for each latent EotE_o^t3 via parallel linear heads.
  • Decoder EotE_o^t4: Linear layer followed by reshaping and 4 upsampling stages (EotE_o^t5 each), each succeeded by a residual block (2 conv layers with BatchNorm+SiLU). Output via a EotE_o^t6 conv and Tanh activation.
  • Transition Prior EotE_o^t7: Autoregressive MADE network over EotE_o^t8 latent dimensions, conditioned on EotE_o^t9 and Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)0, predicting Gaussian mean and scale per block.
  • Normalizing Flow Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)1: 4–6 affine/MAF coupling layers, interleaved with ActNorm and invertible Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)2 convs.
  • Assignment Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)3: Latent-to-block mapping via Gumbel-Softmax over the Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)4 latents and Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)5 blocks.

Training uses Adam (lr Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)6), batch size 512, Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)7, Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)8, Xt=h(Ct,Eot)X^t = h(C^t, E_o^t)9 for 3D, over 600–1000 epochs.

Pseudocode for one step: Ct=(C1t,…,CKt)C^t = (C^t_1, \ldots, C^t_K)00

6. Empirical Evaluation on 3D Scene Sequences

CITRIS is evaluated on the Temporal-Causal3DIdent dataset:

  • Seven causal factors: object 3D position It∈{0,1}KI^t \in \{0,1\}^K0, object rotations It∈{0,1}KI^t \in \{0,1\}^K1, spotlight rotation It∈{0,1}KI^t \in \{0,1\}^K2, object/spotlight/background hues It∈{0,1}KI^t \in \{0,1\}^K3, and object shape It∈{0,1}KI^t \in \{0,1\}^K4.
  • Interventions: Each It∈{0,1}KI^t \in \{0,1\}^K5, random resetting.
  • Train/test split: 250k training, 10k test frames.

Metrics include blockwise It∈{0,1}KI^t \in \{0,1\}^K6 to ground-truth factors (both diagonal It∈{0,1}KI^t \in \{0,1\}^K7 and separated It∈{0,1}KI^t \in \{0,1\}^K8), Spearman correlation, and "triplet evaluation"—combining blocks from different sequences in latent space and measuring recovery fidelity via a specialized CNN encoder.

Key findings:

Model It∈{0,1}KI^t \in \{0,1\}^K9 Iit=1I^t_i = 10 Triplet Error
CITRIS-VAE Iit=1I^t_i = 11 0.9+ — —
CITRIS-NF Iit=1I^t_i = 12 Iit=1I^t_i = 13 Iit=1I^t_i = 14 .04
SlowVAE (base.) — — Entangled
iVAE* (base.) — — —
  • CITRIS-NF achieves Iit=1I^t_i = 15 and triplet error Iit=1I^t_i = 16 (Teapot dataset), outperforming prior approaches such as SlowVAE (which entangles factors) and iVAE* (failing on correlated multidimensional factors like hue and rotation).

On the Interventional Pong domain, CITRIS disentangles five intervened factors (ball position/velocity, paddle positions) with Iit=1I^t_i = 17–Iit=1I^t_i = 18, and effectively attributes 'score' (non-intervened) to the nuisance block.

7. Generalization and Theoretical-Limitation Analysis

The AE+flow variant allows the autoencoder to be pretrained on heterogeneous observational sources (e.g., blending simulated and real data), and the flow adapted with synthetic interventional data only. Empirical results demonstrate zero-shot generalization to unseen object shapes, with Iit=1I^t_i = 19 and a moderate triplet error. Performance drops slightly for position and rotation when unseen shape categories present different default axes, but minimal variables are still isolated.

Identifiability requires that for every causal factor, there exist both intervened and non-intervened instances. If two factors are always—or never—jointly intervened, they are not separable (Proposition 3.1). The approach fundamentally relies on observing the intervention targets, though not their realized values. Identifiability is defined up to blockwise invertible transforms, and unrestricted rotation/mixing within blocks is not penalized.

References:

CITRIS: Causal Identifiability from Temporal Intervened Sequences (Lippe et al., 2022)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D Causal Variational Autoencoder (3D CausalVAE).