Slice VAE: 2D to 3D Volumetric Generation
- Slice VAE is a method that decomposes 3D volume generation into independent 2D slice modeling and joint latent space regularization, enabling scalable, high-resolution reconstruction.
- It employs a β-ELBO objective (with β=0.2) to balance reconstruction accuracy and latent regularity, ensuring diverse and anatomically plausible outputs.
- Empirical evaluations using metrics like MMD, MS-SSIM, and RAS demonstrate significant computational savings and improved anatomical consistency over traditional 3D models.
Slice VAE is a family of generative models designed to efficiently capture the distribution of volumetric data, such as 3D brain MRI, by leveraging two-dimensional Variational Autoencoders (2D VAEs) trained on image slices and modeling inter-slice dependencies in latent space. This approach yields high-resolution, coherent reconstructions of large volumetric datasets using tractable 2D architectures, and has notable computational and qualitative advantages over conventional 3D generative models (Volokitin et al., 2020).
1. Core Model Architecture
Slice VAE decomposes the modeling of 3D volumes into two stages: independent learning of slice-wise appearance and joint modeling of between-slice dependencies.
A standard 2D VAE, parameterized by encoder and decoder , is trained on individual slices of dimensions (e.g., pixels). The encoder network is constructed from convolutional “ResDown” blocks (residual, downsampling, doubling channels), followed by fully connected layers yielding the estimated Gaussian parameters and for the latent representation . The decoder inverts this architecture with “ResUp” residual upsampling blocks, finally outputting the reconstructed slice via a convolution and either sigmoid or linear activation.
At training, each coronal slice is encoded as and reconstructed by decoding . This modular slice-level parameterization scales to higher 3D resolutions than direct 3D VAEs or GANs.
2. Variational Objective and Regularization
The model is optimized with a β-ELBO per-slice variational objective: where . For trade-off between reconstruction fidelity and latent regularization, is used.
Explicitly, the loss is: This preserves diversity and sharpness in generated slices and regularizes the latent space for tractable probabilistic modeling.
3. Modeling Inter-Slice Structure
Upon VAE training, latent representations are extracted for each slice across all training volumes. For each latent dimension , the latent trajectory is modeled as a Gaussian: . The empirical mean and covariance are estimated from all volumes along the stack direction.
Latent covariance is computed via SVD: for the slice–latent matrix , set so that . Stacking across latent coordinates forms an overall block-diagonal covariance .
This approach yields an explicit, efficiently estimated latent process capturing realistic anatomical variation and consistency across slices.
4. Sampling and Volume Generation
Generation of a new 3D volume proceeds by sequential latent sampling:
- For each latent dimension and each slice , sample ;
- Compute for all ;
- Form the composite latent codes and decode each slice as ;
- Stack the slices to reconstruct the coherent 3D volume.
The dependency structure inherited from , ensures that neighboring slices are consistent and realistic, addressing the major challenge of generating plausible volumetric data.
5. Evaluation: Realistic Atlas Score (RAS)
Slice VAE introduces the Realistic Atlas Score (RAS) to quantify anatomical realism of generated volumes. For each synthetic volume, a pretrained CNN segmenter yields a segmentation map, which is spatially registered to a canonical real atlas by affine transformation. The Dice similarity coefficient (DSC) is then computed between the warped synthetic segmentation and ground truth atlas segmentation,
and the RAS is the DSC averaged over all anatomical labels and test image pairings.
RAS is sensitive to both anatomical plausibility and spatial registration fidelity, providing a more meaningful assessment than pure image similarity metrics for medical applications.
6. Empirical Results and Computational Advantages
Quantitatively, at resolution, Slice VAE achieves MMD 19,890 (vs. $64,446$ for 3D -WGAN) and MS-SSIM 0.9120 (diversity; lower is more diverse) with RAS close to the upper bound for real data. At , 3D -WGAN fails (blocky artifacts, high MMD), while Slice VAE achieves improved MMD 323,233, MS-SSIM 0.8768, and higher RAS than GAN baselines. Visual inspection reveals some blurriness (a common VAE artifact) but better preservation of global anatomical structure compared to other high-dimensional generative models.
The computational cost is substantially reduced: with only a 2D VAE and low-dimensional multivariate Gaussians for inter-slice modeling, memory and data requirements are significantly below those of 3D VAEs/GANs trained on full volumetric data. Sampling is explicit and efficient due to the simplified latent structure (Volokitin et al., 2020).
In summary, Slice VAE enables scalable, high-resolution volumetric generative modeling by decoupling slice-wise representation learning from latent sequence modeling, providing an effective and practical solution for medical imaging and related domains where full 3D modeling is computationally prohibitive. The framework further generalizes to related 2D-to-3D reconstruction settings, as seen in methods such as RockGPT’s VQ-VAE module and other conditional "slice-VAE" architectures for tomographic data (Zheng et al., 2021).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free