Papers
Topics
Authors
Recent
2000 character limit reached

MM-VAMP VAE: Generative Models for Multimodal Data

Updated 24 November 2025
  • The paper introduces MM-VAMP VAE, extending the VAE framework by incorporating a soft, data-dependent mixture-of-experts prior to handle multimodal and conditional mappings.
  • The methodology uses modality-specific encoders/decoders and a unified or MDN-based mixture prior, with a specialized ELBO that balances shared and modality-specific latent features.
  • Empirical results show improved reconstruction accuracy and latent coherence across diverse datasets, notably achieving lower error rates in human-robot interactions compared to baseline models.

A Multimodal Variational Mixture-of-Experts Variational Autoencoder (MM-VAMP VAE) is a class of latent variable generative models that extends the standard VAE formalism for settings involving either multimodal data (multiple distinct observation channels, such as image + text; (Sutter et al., 8 Mar 2024)) or the modeling of conditionally multimodal mappings (e.g., human-to-robot interaction, where a human action can induce multiple plausible robot reactions; (Prasad et al., 10 Jul 2024)). MM-VAMP VAEs use a learned, data-dependent mixture-of-experts prior over the latent variables—crucially, this prior is informed by observations and is composed itself as a mixture of (posterior) “experts” or as a mixture density network (MDN) fed by designated input streams. The framework thus enables soft sharing of information across modalities or agents and principled handling of multimodal latent spaces.

1. Model Formulation and Generative Structure

There are two dominant lines in MM-VAMP VAE research: (1) multimodal data modeling, where each modality has an encoder and decoder, and the latent prior softly aggregates unimodal posteriors via a mixture; (2) mixture-of-experts priors using MDNs conditioned on input observations for structured data (e.g., HRI).

General multimodal architecture (Sutter et al., 8 Mar 2024):

  • Let X={x1,...,xM}X = \{x_1, ..., x_M\} be data from MM modalities.
  • Each modality has its own encoder qϕm(zmxm)q^m_\phi(z_m|x_m) and decoder pθm(xmzm)p^m_\theta(x_m|z_m) (typically parameterized by small ResNets or MLPs).
  • Latent space is block-factored: z=(z1,...,zM)z = (z_1, ..., z_M), with each zmRdmz_m \in \mathbb{R}^{d_m}.
  • Conditional independence: pθ(Xz)=m=1Mpθm(xmzm)p_\theta(X|z) = \prod_{m=1}^M p^m_\theta(x_m|z_m).

Mixture-of-experts prior:

  • The prior for each latent block is a mixture of all MM unimodal posteriors:

h(zmX)=1Mk=1Mqϕk(zmxk)h(z_m|X) = \frac{1}{M} \sum_{k=1}^M q^k_\phi(z_m|x_k)

yielding a factored mixture prior h(zX)=m=1Mh(zmX)h(z|X) = \prod_{m=1}^M h(z_m|X).

Conditional mixture MDN prior (Prasad et al., 10 Jul 2024):

  • For HRI, the latent prior is a mixture over KK components, with parameters produced via a Mixture Density Network (MDN) conditioned on the "human" input xth\mathbf{x}^h_t:

p(ztxth)=k=1Kπk(xth)N(ztμk(xth),diag(σk2(xth)))p(\mathbf{z}_t | \mathbf{x}^h_t) = \sum_{k=1}^K \pi_k(\mathbf{x}^h_t) \mathcal{N}\left(\mathbf{z}_t | \boldsymbol{\mu}_k(\mathbf{x}^h_t), \operatorname{diag}(\boldsymbol{\sigma}_k^2(\mathbf{x}^h_t)) \right)

  • The likelihood/decoder generates robot actions: p(xtrzt)=N(xtrDec(zt),diag(τ2))p(\mathbf{x}^r_t|\mathbf{z}_t) = \mathcal{N}(\mathbf{x}^r_t \mid \mathrm{Dec}(\mathbf{z}_t), \operatorname{diag}(\tau^2)).

2. Variational Inference and Training Objective

Across both frameworks, the inference model (recognition network) approximates the posterior over latents given observations.

Multimodal setting (Sutter et al., 8 Mar 2024):

  • Each encoder qϕm(zmxm)q^m_\phi(z_m|x_m) is trained to approximate the latent for its modality.
  • The training objective is a multimodal ELBO:

E(X)=Eqϕ(zX)[logpθ(Xz)]KL(qϕ(zX)h(zX))\mathcal{E}(X) = \mathbb{E}_{q_\phi(z|X)} [\log p_\theta(X|z)] - KL(q_\phi(z|X)\Vert h(z|X))

with the KL regularization term factorizing across latent blocks and encoding a Jensen–Shannon (JS) divergence between unimodal posteriors.

MDN/conditional MoE (Prasad et al., 10 Jul 2024):

  • The encoder is restricted to the robot side: q(ztxtr)=N(ztμenc(xtr),diag(σenc2(xtr)))q(\mathbf{z}_t|\mathbf{x}^r_t) = \mathcal{N}(\mathbf{z}_t|\boldsymbol{\mu}_\mathrm{enc}(\mathbf{x}^r_t), \operatorname{diag}(\boldsymbol{\sigma}_\mathrm{enc}^2(\mathbf{x}^r_t))).
  • The ELBO per time step:

ELBOtr=Eq(ztxtr)[logp(xtrzt)]βKL[q(ztxtr)p(ztxth)]\mathrm{ELBO}^r_t = \mathbb{E}_{q(\mathbf{z}_t|\mathbf{x}^r_t)}[\log p(\mathbf{x}^r_t|\mathbf{z}_t)] - \beta KL \left[ q(\mathbf{z}_t|\mathbf{x}^r_t) \| p(\mathbf{z}_t|\mathbf{x}^h_t) \right]

  • Regularization term Ltsep\mathcal{L}^{\mathrm{sep}}_t includes: mean separation, temporal smoothness of expert means, and entropy on mixtures to prevent mode collapse.

3. Architectural and Algorithmic Details

Component MM-VAMP VAE (Sutter et al., 8 Mar 2024) MoVEInt (Prasad et al., 10 Jul 2024)
Encoder MLP/ResNet per modality FC \to LeakyReLU \to linear heads
Decoder MLP/ResNet per modality Mirrored MLP, outputs mean
Prior Uniform mixture of posteriors MDN: FC \to GRU \to expert heads
Mixture Weights Uniform (1/M) Data-driven (softmax over GRU output)
Latent Dim Typically 32–128 per block 5 (interactions), 10 (handover)
Training Adam, $200$–$1000$ epochs Adam, $200$–$500$ epochs

In both cases, standard VAE tricks—such as reparametrized Gaussian sampling, minibatch-based stochastic optimization, and β\beta-weighted KL terms—are used.

4. Theoretical Properties and Regularization Strategies

A key distinction of MM-VAMP VAEs is the use of a soft, mixture-based prior, in contrast to hard sharing (e.g., Product-of-Experts, concatenation) or fixed priors. This mixture prior is shown in (Sutter et al., 8 Mar 2024) to maximize the ELBO among all data-dependent factorized priors, and the regularization term in the ELBO reduces to a scaled JS divergence between the unimodal posteriors:

m=1MKL(qm1Mkqk)=MJS(q1,...,qM)\sum_{m=1}^M KL(q^m \Vert \textstyle\frac{1}{M} \sum_k q^k) = M \cdot JS(q^1,...,q^M)

This penalizes excessive divergence between modalities' latents but does not enforce collapse, thereby balancing shared against modality-specific structure.

For conditional multimodality (MoVEInt), mode collapse is further addressed by augmenting the loss with:

  • Mean separation penalty: encourages expert means to be apart.
  • Temporal smoothness: enforces expert means evolve smoothly.
  • Mixture entropy: discourages degenerate (low entropy) expert weights.

No empirical collapse of latents is observed even for small β\beta or aggressive training (Sutter et al., 8 Mar 2024), and regularization provides further stability in MDN-based instantiations (Prasad et al., 10 Jul 2024).

5. Empirical Results and Comparative Performance

Benchmark datasets (Sutter et al., 8 Mar 2024):

  • PolyMNIST: MM-VAMP achieves latent accuracy ≈ 0.92 at MSE = 8.1, compared to independent VAEs (0.80/10.5) and aggregation-based joint VAEs (0.78/8.2).
  • Bimodal CelebA and rodent CA1 neuroscience datasets: superior latent representation and imputation/coherence of missing modalities.

Human-robot interaction (Prasad et al., 10 Jul 2024):

  • Datasets: Multi-interaction (waving, handshake, fistbump) HRI/HHI tasks; “NuiSI” skeleton data; object handover trajectories.
  • MM-VAMP (MoVEInt) achieves lowest mean squared error (MSE) in 10 out of 12 human–robot pairs versus HMM-regularized VAEs (MILD) and LSTM baselines.
  • Example metric (waving gesture, HHI):
    • MILD: 0.788 ± 1.226 (cm)
    • LSTM: 4.121 ± 2.252 (cm)
    • MoVEInt: 0.448 ± 0.630 (cm)
  • Real-world robot handover: 85% success rate (51/60) across naïve users and objects; failure attributed to perception/timing mismatches.

6. Relationships to Classical Methods and Limitations

The mixture-of-experts prior in MM-VAMP VAE for HRI is directly related to Gaussian Mixture Regression (GMR): the neural mixture prior implements the same conditional structure as GMR but allows the mixture parameters to be learned end-to-end via backpropagation and without explicit fitting of a joint GMM+HMM (Prasad et al., 10 Jul 2024).

In contrast to earlier approaches (e.g., HMM-based regularization or joint-posterior VAEs), MM-VAMP offers joint learning of all tasks, improved imputation/generation, and greater flexibility in information sharing. In the uniform-mixing multimodal case, no separate gating or attention is necessary; in MDN-based variants, a recurrent gating mechanism (e.g., GRU+softmax) generates data-dependent mixture weights.

A limitation is the computational cost of evaluating mixture KLs or training multiple modality-wise encoders when MM is large. For MoVEInt, the mixture model must be explicitly regularized against mode collapse, an issue less pressing in the soft-mixing multimodal aggregation setting.

7. Extensions, Applications, and Open Directions

Extensions of MM-VAMP are suggested but not explored in the cited works: nonuniform, learnable mixture weights or gating networks for asymmetric sharing; stacking of MM-VAMP blocks for hierarchical modeling; and hybridization with contrastive learning paradigms via the equivalence of the JS-divergence penalty to certain contrastive losses (Sutter et al., 8 Mar 2024).

The architecture is broadly applicable: from unsupervised and conditional multimodal representation learning (e.g., cross-modal image/text/brain data) to flexible, informative priors in time-series generative models and human–robot behavioral modeling. The demonstrated improvements in reconstruction accuracy, latent discriminability, and coherence in missing-modality imputation mark MM-VAMP as a foundation for future multimodal and conditional generative models.


References:

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multimodal Variational Mixture-of-Experts (MM-VAMP) VAE.