Optimal latent distribution choice for two-stage visual generative modeling

Determine which latent distribution structures for the encoder’s aggregate posterior in two-stage visual generative pipelines—where images are compressed into latent codes and a prior over those latents is modeled via diffusion or autoregressive methods—are optimal for subsequent modeling of the latent prior.

Background

Modern visual generative models typically follow a two-stage pipeline that first encodes images into a latent representation and then models the prior over these latents using diffusion or autoregressive techniques. The effectiveness of this pipeline depends critically on the properties of the latent distribution, which must balance modeling simplicity and reconstruction fidelity.

Existing tokenizers such as VAEs and encoders aligned to foundation models constrain the latent space without explicitly shaping its overall distribution, leaving uncertainty about which latent distribution forms are best suited for modeling. The paper introduces Distribution Matching VAE (DMVAE) to explicitly align the encoder’s aggregate posterior with various reference distributions (e.g., Gaussian, SSL features, text embeddings), enabling a systematic investigation. Although the authors present empirical evidence favoring SSL-derived distributions, the general question of the optimal latent distribution remains explicitly stated as unclear.

References

Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling.

Distribution Matching Variational AutoEncoder (2512.07778 - Ye et al., 8 Dec 2025) in Abstract (page 1)