Slice VAE: 2D to 3D Volumetric Generation

Updated 19 November 2025

Slice VAE is a method that decomposes 3D volume generation into independent 2D slice modeling and joint latent space regularization, enabling scalable, high-resolution reconstruction.
It employs a β-ELBO objective (with β=0.2) to balance reconstruction accuracy and latent regularity, ensuring diverse and anatomically plausible outputs.
Empirical evaluations using metrics like MMD, MS-SSIM, and RAS demonstrate significant computational savings and improved anatomical consistency over traditional 3D models.

Slice VAE is a family of generative models designed to efficiently capture the distribution of volumetric data, such as 3D brain MRI, by leveraging two-dimensional Variational Autoencoders (2D VAEs) trained on image slices and modeling inter-slice dependencies in latent space. This approach yields high-resolution, coherent reconstructions of large volumetric datasets using tractable 2D architectures, and has notable computational and qualitative advantages over conventional 3D generative models (Volokitin et al., 2020).

1. Core Model Architecture

Slice VAE decomposes the modeling of 3D volumes into two stages: independent learning of slice-wise appearance and joint modeling of between-slice dependencies.

A standard 2D VAE, parameterized by encoder $q_\phi(z|x)$ and decoder $p_\theta(x|z)$ , is trained on individual slices of dimensions $H \times W$ (e.g., $256 \times 256$ pixels). The encoder network is constructed from convolutional “ResDown” blocks (residual, downsampling, doubling channels), followed by fully connected layers yielding the estimated Gaussian parameters $\mu(x)$ and $\log \sigma(x)$ for the latent representation $z\in \mathbb{R}^L$ . The decoder inverts this architecture with “ResUp” residual upsampling blocks, finally outputting the reconstructed slice via a $1\times1$ convolution and either sigmoid or linear activation.

At training, each coronal slice $x$ is encoded as $q_\phi(z|x)=\mathcal{N}(z;\mu(x), \operatorname{diag}\,\sigma^2(x))$ and reconstructed by decoding $p_\theta(x|z)$ . This modular slice-level parameterization scales to higher 3D resolutions than direct 3D VAEs or GANs.

2. Variational Objective and Regularization

The model is optimized with a β-ELBO per-slice variational objective: $\mathrm{ELBO}(x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \beta\, D_{\mathrm{KL}}(q_\phi(z|x)\,||\,p(z)),$ where $p(z)=\mathcal{N}(0,I)$ . For trade-off between reconstruction fidelity and latent regularization, $\beta=0.2$ is used.

Explicitly, the loss is: $\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)} \left[\log \mathcal{N}(x;\mathrm{decoder}_\theta(z), \sigma^2 I)\right] - 0.2\, D_{\mathrm{KL}}(\mathcal{N}(z;\mu(x),\sigma^2(x))\,||\,\mathcal{N}(0,I)).$ This preserves diversity and sharpness in generated slices and regularizes the latent space for tractable probabilistic modeling.

3. Modeling Inter-Slice Structure

Upon VAE training, latent representations $y(t)$ are extracted for each slice $t$ across all training volumes. For each latent dimension $\ell=1\dots L$ , the latent trajectory $y_\ell = [y_\ell(1),\ldots,y_\ell(T)]$ is modeled as a Gaussian: $p(y_\ell) = \mathcal{N}(y_\ell|\mu_\ell, \Sigma_\ell)$ . The empirical mean $\mu_\ell$ and covariance $\Sigma_\ell$ are estimated from all volumes along the stack direction.

Latent covariance is computed via SVD: for the slice–latent matrix $Y_\ell = U_\ell S_\ell V_\ell^*$ , set $W_\ell = U_\ell S_\ell^{1/2} /\sqrt{N}$ so that $\Sigma_\ell = W_\ell W_\ell^\top$ . Stacking across latent coordinates forms an overall block-diagonal covariance $\Sigma_y$ .

This approach yields an explicit, efficiently estimated latent process capturing realistic anatomical variation and consistency across slices.

4. Sampling and Volume Generation

Generation of a new 3D volume proceeds by sequential latent sampling:

For each latent dimension $\ell$ and each slice $t$ , sample $z_\ell(t)\sim \mathcal{N}(0,1)$ ;
Compute $y_\ell(t) = W_\ell[:,t]^\top z_\ell + \mu_\ell(t)$ for all $t$ ;
Form the composite latent codes $y(t)=[y_1(t),\ldots,y_L(t)]$ and decode each slice as $\hat{x}(t)=\mathrm{decoder}_\theta(y(t))$ ;
Stack the slices $\{\hat{x}(1),\ldots,\hat{x}(T)\}$ to reconstruct the coherent 3D volume.

The dependency structure inherited from $\mu_\ell$ , $\Sigma_\ell$ ensures that neighboring slices are consistent and realistic, addressing the major challenge of generating plausible volumetric data.

5. Evaluation: Realistic Atlas Score (RAS)

Slice VAE introduces the Realistic Atlas Score (RAS) to quantify anatomical realism of generated volumes. For each synthetic volume, a pretrained CNN segmenter yields a segmentation map, which is spatially registered to a canonical real atlas by affine transformation. The Dice similarity coefficient (DSC) is then computed between the warped synthetic segmentation and ground truth atlas segmentation,

$\mathrm{DSC}(A,B) = \frac{2|A \cap B|}{|A| + |B|}$

and the RAS is the DSC averaged over all anatomical labels and test image pairings.

RAS is sensitive to both anatomical plausibility and spatial registration fidelity, providing a more meaningful assessment than pure image similarity metrics for medical applications.

6. Empirical Results and Computational Advantages

Quantitatively, at $128^3$ resolution, Slice VAE achieves MMD $=$ 19,890 (vs. $64,446$ for 3D $\alpha$ -WGAN) and MS-SSIM $=$ 0.9120 (diversity; lower is more diverse) with RAS close to the upper bound for real data. At $256^3$ , 3D $\alpha$ -WGAN fails (blocky artifacts, high MMD), while Slice VAE achieves improved MMD $=$ 323,233, MS-SSIM $=$ 0.8768, and higher RAS than GAN baselines. Visual inspection reveals some blurriness (a common VAE artifact) but better preservation of global anatomical structure compared to other high-dimensional generative models.

The computational cost is substantially reduced: with only a 2D VAE and low-dimensional multivariate Gaussians for inter-slice modeling, memory and data requirements are significantly below those of 3D VAEs/GANs trained on full volumetric data. Sampling is explicit and efficient due to the simplified latent structure (Volokitin et al., 2020).

In summary, Slice VAE enables scalable, high-resolution volumetric generative modeling by decoupling slice-wise representation learning from latent sequence modeling, providing an effective and practical solution for medical imaging and related domains where full 3D modeling is computationally prohibitive. The framework further generalizes to related 2D-to-3D reconstruction settings, as seen in methods such as RockGPT’s VQ-VAE module and other conditional "slice-VAE" architectures for tomographic data (Zheng et al., 2021).

PDF Markdown Chat (Pro)

References (2)

Modelling the Distribution of 3D Brain MRI using a 2D Slice VAE (2020)

RockGPT: Reconstructing three-dimensional digital rocks from single two-dimensional slice from the perspective of video generation (2021)

Follow Topic

Get notified by email when new papers are published related to Slice VAE.