Multi-Head cVAE for Disentangled Generation

Updated 1 July 2026

Multi-Head cVAE is a generative model that decomposes the latent space into label-relevant and label-irrelevant codes for precise control.
The architecture uses SPADE for spatial (label-specific) information and AdaIN for unsupervised style modulation in image generation.
It achieves effective disentanglement, demonstrated by quantitative metrics on datasets like 3D-Chair and FaceScrub, enhancing controllability in tasks such as identity swapping.

A Multi-Head Conditional Variational Autoencoder (cVAE) is a generative model architecture designed to decompose the latent representation into orthogonal components: a label-relevant code capturing structured, controllable information, and a label-irrelevant code capturing complementary, unsupervised variation. In the context of "Disentangling the Spatial Structure and Style in Conditional VAE" (Zhang et al., 2019), this is achieved via a dual-headed latent space where one head encodes spatial structure or style associated with class labels, and the other head encodes class-independent factors. Each head is injected into the decoder with dedicated adaptive normalization mechanisms, enabling effective disentanglement of spatial structure and style in image generation.

1. Model Architecture

The multi-head cVAE consists of three major modules: a label-condition mapping network $f(\cdot)$ generating a label-relevant code $z_s$ , an encoder producing a label-irrelevant latent code $z_u$ , and a decoder conditioned on both $z_s$ and $z_u$ at every upsampling layer.

Label-Condition Mapping ( $f(\cdot) \rightarrow z_s$ ):
- Input: a one-hot label $c \in \{0,1\}^{N_c}$ .
- Architecture: a multi-layer perceptron (3–4 fully-connected layers, width ≈ 512).
- Output: embedding $z_s = f(c)$ , which may be shaped as a spatial map $(H/k) \times (W/k) \times 1$ (if $c$ carries spatial information, e.g., pose) or a vector $z_s$ 0 (for categorical labels). Practical choices include $z_s$ 1 and $z_s$ 2 for $z_s$ 3 images (so $z_s$ 4 shape is $z_s$ 5).
Encoder ( $z_s$ 6):
- Input: image $z_s$ 7 (concatenated with label maps if necessary).
- Architecture: strided convolutional blocks downsampling to either a $z_s$ 8 vector (style-posterior) or spatial map $z_s$ 9 (structure-posterior).
- Latent outputs: mean $z_u$ 0 and std $z_u$ 1 parameterizing $z_u$ 2.
Decoder:
- Begins with a learned constant input.
- Each upsampling block receives both $z_u$ 3 (via SPADE) and $z_u$ 4 (via AdaIN) to modulate activations.

2. Probabilistic Framework

Let $z_u$ 5 denote the deterministic, label-relevant (label "head") code, and $z_u$ 6 denote the stochastic, label-irrelevant (uncorrelated "head") code.

Priors:
- $z_u$ 7 (isotropic Gaussian).
- $z_u$ 8 (deterministic).
Posteriors:
- $z_u$ 9
- $z_s$ 0
Sampling:
- $z_s$ 1, $z_s$ 2 (reparameterization)
- $z_s$ 3 (deterministic)
ELBO Objective:

$z_s$ 4

Given $z_s$ 5, the last term vanishes. The likelihood is implemented as an $z_s$ 6 or $z_s$ 7 image reconstruction loss.

Adversarial Learning:
- Uses a cGAN-style hinge loss to sharpen outputs. A discriminator $z_s$ 8 distinguishes between real and generated data, including cases with permuted labels or random $z_s$ 9.

3. Adaptive Normalization in Decoding

At each decoder layer $z_u$ 0, spatial and style codes modulate the activations via two normalization modules:

SPADE (label-relevant $z_u$ 1):
- Produces $z_u$ 2, $z_u$ 3 with spatial dimensions, via a small convolutional network upsampling $z_u$ 4 to $z_u$ 5.
AdaIN (label-irrelevant $z_u$ 6):
- Produces channel-wise $z_u$ 7, $z_u$ 8 using an MLP applied to $z_u$ 9 and broadcast spatially.

Given pre-activation $f(\cdot) \rightarrow z_s$ 0, normalization proceeds: $f(\cdot) \rightarrow z_s$ 1 The output features $f(\cdot) \rightarrow z_s$ 2 and $f(\cdot) \rightarrow z_s$ 3 are concatenated along the channel dimension and projected via a $f(\cdot) \rightarrow z_s$ 4 convolution to restore channel size.

4. Implementation Configurations and Ablations

Key practical choices and architectural variants:

Variant	$f(\cdot) \rightarrow z_s$ 5 injection	$f(\cdot) \rightarrow z_s$ 6 injection
S1	AdaIN	concat-input
S2	SPADE	concat-input
S3	AdaIN	AdaIN
S4	AdaIN	SPADE
Proposed	SPADE	AdaIN

Both $f(\cdot) \rightarrow z_s$ 7 and $f(\cdot) \rightarrow z_s$ 8 are dimensioned to 256, yielding in the structure code case $f(\cdot) \rightarrow z_s$ 9 of $c \in \{0,1\}^{N_c}$ 0 (for $c \in \{0,1\}^{N_c}$ 1 images) and $c \in \{0,1\}^{N_c}$ 2 as $c \in \{0,1\}^{N_c}$ 3, or vice versa for style code scenarios.
Encoder and decoder convolutional blocks follow channel progression $c \in \{0,1\}^{N_c}$ 4.
Default datasets: 3D-Chair ( $c \in \{0,1\}^{N_c}$ 5), FaceScrub ( $c \in \{0,1\}^{N_c}$ 6).
Optimizer: Adam; learning rate and batch size are not fixed in the paper but typical settings are used (e.g., $c \in \{0,1\}^{N_c}$ 7, $c \in \{0,1\}^{N_c}$ 8, $c \in \{0,1\}^{N_c}$ 9).

5. Quantitative and Qualitative Performance

Performance of the proposed disentangling design is demonstrated via experiments on 3D-Chair (label captures azimuth/viewpoint) and FaceScrub datasets (label as identity):

3D-Chair:
- Mutual Information $z_s = f(c)$ 0 (lower is better; indicates improved disentanglement).
- ResNet-50 classification accuracy at target azimuth: $z_s = f(c)$ 1.
FaceScrub:
- Identity-classification accuracy: $z_s = f(c)$ 2.
- Fréchet Inception Distance (FID): $z_s = f(c)$ 3.

Qualitative results show that reconstructed or generated samples can enforce a target identity or viewpoint while preserving complementary factors such as style, pose, or expression.

6. Significance and Context

This design cleanly separates label-associated (structured) factors from unsupervised (residual) variation. By employing SPADE and AdaIN at every decoder stage—feeding the label-relevant and label-irrelevant codes respectively—conditional cVAE generation becomes modular, controllable, and suited for tasks demanding disentanglement. The approach enables, for example, faithful identity swapping in faces or view manipulation in 3D objects, where label information may or may not carry spatial meaning. This separation of signal pathways, along with the adversarial sharpness constraint, is shown to outperform simpler approaches in terms of both disentanglement metrics and visual fidelity (Zhang et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Disentangling the Spatial Structure and Style in Conditional VAE (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Head Conditional Variational Autoencoder (cVAE).