Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Head cVAE for Disentangled Generation

Updated 1 July 2026
  • Multi-Head cVAE is a generative model that decomposes the latent space into label-relevant and label-irrelevant codes for precise control.
  • The architecture uses SPADE for spatial (label-specific) information and AdaIN for unsupervised style modulation in image generation.
  • It achieves effective disentanglement, demonstrated by quantitative metrics on datasets like 3D-Chair and FaceScrub, enhancing controllability in tasks such as identity swapping.

A Multi-Head Conditional Variational Autoencoder (cVAE) is a generative model architecture designed to decompose the latent representation into orthogonal components: a label-relevant code capturing structured, controllable information, and a label-irrelevant code capturing complementary, unsupervised variation. In the context of "Disentangling the Spatial Structure and Style in Conditional VAE" (Zhang et al., 2019), this is achieved via a dual-headed latent space where one head encodes spatial structure or style associated with class labels, and the other head encodes class-independent factors. Each head is injected into the decoder with dedicated adaptive normalization mechanisms, enabling effective disentanglement of spatial structure and style in image generation.

1. Model Architecture

The multi-head cVAE consists of three major modules: a label-condition mapping network f(â‹…)f(\cdot) generating a label-relevant code zsz_s, an encoder producing a label-irrelevant latent code zuz_u, and a decoder conditioned on both zsz_s and zuz_u at every upsampling layer.

  • Label-Condition Mapping (f(â‹…)→zsf(\cdot) \rightarrow z_s):
    • Input: a one-hot label c∈{0,1}Ncc \in \{0,1\}^{N_c}.
    • Architecture: a multi-layer perceptron (3–4 fully-connected layers, width ≈ 512).
    • Output: embedding zs=f(c)z_s = f(c), which may be shaped as a spatial map (H/k)×(W/k)×1(H/k) \times (W/k) \times 1 (if cc carries spatial information, e.g., pose) or a vector zsz_s0 (for categorical labels). Practical choices include zsz_s1 and zsz_s2 for zsz_s3 images (so zsz_s4 shape is zsz_s5).
  • Encoder (zsz_s6):
    • Input: image zsz_s7 (concatenated with label maps if necessary).
    • Architecture: strided convolutional blocks downsampling to either a zsz_s8 vector (style-posterior) or spatial map zsz_s9 (structure-posterior).
    • Latent outputs: mean zuz_u0 and std zuz_u1 parameterizing zuz_u2.
  • Decoder:
    • Begins with a learned constant input.
    • Each upsampling block receives both zuz_u3 (via SPADE) and zuz_u4 (via AdaIN) to modulate activations.

2. Probabilistic Framework

Let zuz_u5 denote the deterministic, label-relevant (label "head") code, and zuz_u6 denote the stochastic, label-irrelevant (uncorrelated "head") code.

  • Priors:
    • zuz_u7 (isotropic Gaussian).
    • zuz_u8 (deterministic).
  • Posteriors:
    • zuz_u9
    • zsz_s0
  • Sampling:
    • zsz_s1, zsz_s2 (reparameterization)
    • zsz_s3 (deterministic)
  • ELBO Objective:

zsz_s4

Given zsz_s5, the last term vanishes. The likelihood is implemented as an zsz_s6 or zsz_s7 image reconstruction loss.

  • Adversarial Learning:
    • Uses a cGAN-style hinge loss to sharpen outputs. A discriminator zsz_s8 distinguishes between real and generated data, including cases with permuted labels or random zsz_s9.

3. Adaptive Normalization in Decoding

At each decoder layer zuz_u0, spatial and style codes modulate the activations via two normalization modules:

  • SPADE (label-relevant zuz_u1):
    • Produces zuz_u2, zuz_u3 with spatial dimensions, via a small convolutional network upsampling zuz_u4 to zuz_u5.
  • AdaIN (label-irrelevant zuz_u6):
    • Produces channel-wise zuz_u7, zuz_u8 using an MLP applied to zuz_u9 and broadcast spatially.

Given pre-activation f(⋅)→zsf(\cdot) \rightarrow z_s0, normalization proceeds: f(⋅)→zsf(\cdot) \rightarrow z_s1 The output features f(⋅)→zsf(\cdot) \rightarrow z_s2 and f(⋅)→zsf(\cdot) \rightarrow z_s3 are concatenated along the channel dimension and projected via a f(⋅)→zsf(\cdot) \rightarrow z_s4 convolution to restore channel size.

4. Implementation Configurations and Ablations

Key practical choices and architectural variants:

Variant f(⋅)→zsf(\cdot) \rightarrow z_s5 injection f(⋅)→zsf(\cdot) \rightarrow z_s6 injection
S1 AdaIN concat-input
S2 SPADE concat-input
S3 AdaIN AdaIN
S4 AdaIN SPADE
Proposed SPADE AdaIN
  • Both f(â‹…)→zsf(\cdot) \rightarrow z_s7 and f(â‹…)→zsf(\cdot) \rightarrow z_s8 are dimensioned to 256, yielding in the structure code case f(â‹…)→zsf(\cdot) \rightarrow z_s9 of c∈{0,1}Ncc \in \{0,1\}^{N_c}0 (for c∈{0,1}Ncc \in \{0,1\}^{N_c}1 images) and c∈{0,1}Ncc \in \{0,1\}^{N_c}2 as c∈{0,1}Ncc \in \{0,1\}^{N_c}3, or vice versa for style code scenarios.
  • Encoder and decoder convolutional blocks follow channel progression c∈{0,1}Ncc \in \{0,1\}^{N_c}4.
  • Default datasets: 3D-Chair (c∈{0,1}Ncc \in \{0,1\}^{N_c}5), FaceScrub (c∈{0,1}Ncc \in \{0,1\}^{N_c}6).
  • Optimizer: Adam; learning rate and batch size are not fixed in the paper but typical settings are used (e.g., c∈{0,1}Ncc \in \{0,1\}^{N_c}7, c∈{0,1}Ncc \in \{0,1\}^{N_c}8, c∈{0,1}Ncc \in \{0,1\}^{N_c}9).

5. Quantitative and Qualitative Performance

Performance of the proposed disentangling design is demonstrated via experiments on 3D-Chair (label captures azimuth/viewpoint) and FaceScrub datasets (label as identity):

  • 3D-Chair:
    • Mutual Information zs=f(c)z_s = f(c)0 (lower is better; indicates improved disentanglement).
    • ResNet-50 classification accuracy at target azimuth: zs=f(c)z_s = f(c)1.
  • FaceScrub:

Qualitative results show that reconstructed or generated samples can enforce a target identity or viewpoint while preserving complementary factors such as style, pose, or expression.

6. Significance and Context

This design cleanly separates label-associated (structured) factors from unsupervised (residual) variation. By employing SPADE and AdaIN at every decoder stage—feeding the label-relevant and label-irrelevant codes respectively—conditional cVAE generation becomes modular, controllable, and suited for tasks demanding disentanglement. The approach enables, for example, faithful identity swapping in faces or view manipulation in 3D objects, where label information may or may not carry spatial meaning. This separation of signal pathways, along with the adversarial sharpness constraint, is shown to outperform simpler approaches in terms of both disentanglement metrics and visual fidelity (Zhang et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Head Conditional Variational Autoencoder (cVAE).