MM-VAMP VAE: Generative Models for Multimodal Data

Updated 24 November 2025

The paper introduces MM-VAMP VAE, extending the VAE framework by incorporating a soft, data-dependent mixture-of-experts prior to handle multimodal and conditional mappings.
The methodology uses modality-specific encoders/decoders and a unified or MDN-based mixture prior, with a specialized ELBO that balances shared and modality-specific latent features.
Empirical results show improved reconstruction accuracy and latent coherence across diverse datasets, notably achieving lower error rates in human-robot interactions compared to baseline models.

A Multimodal Variational Mixture-of-Experts Variational Autoencoder (MM-VAMP VAE) is a class of latent variable generative models that extends the standard VAE formalism for settings involving either multimodal data (multiple distinct observation channels, such as image + text; (Sutter et al., 2024)) or the modeling of conditionally multimodal mappings (e.g., human-to-robot interaction, where a human action can induce multiple plausible robot reactions; (Prasad et al., 2024)). MM-VAMP VAEs use a learned, data-dependent mixture-of-experts prior over the latent variables—crucially, this prior is informed by observations and is composed itself as a mixture of (posterior) “experts” or as a mixture density network (MDN) fed by designated input streams. The framework thus enables soft sharing of information across modalities or agents and principled handling of multimodal latent spaces.

1. Model Formulation and Generative Structure

There are two dominant lines in MM-VAMP VAE research: (1) multimodal data modeling, where each modality has an encoder and decoder, and the latent prior softly aggregates unimodal posteriors via a mixture; (2) mixture-of-experts priors using MDNs conditioned on input observations for structured data (e.g., HRI).

General multimodal architecture (Sutter et al., 2024):

Let $X = \{x_1, ..., x_M\}$ be data from $M$ modalities.
Each modality has its own encoder $q^m_\phi(z_m|x_m)$ and decoder $p^m_\theta(x_m|z_m)$ (typically parameterized by small ResNets or MLPs).
Latent space is block-factored: $z = (z_1, ..., z_M)$ , with each $z_m \in \mathbb{R}^{d_m}$ .
Conditional independence: $p_\theta(X|z) = \prod_{m=1}^M p^m_\theta(x_m|z_m)$ .

Mixture-of-experts prior:

The prior for each latent block is a mixture of all $M$ unimodal posteriors:

$h(z_m|X) = \frac{1}{M} \sum_{k=1}^M q^k_\phi(z_m|x_k)$

yielding a factored mixture prior $h(z|X) = \prod_{m=1}^M h(z_m|X)$ .

Conditional mixture MDN prior (Prasad et al., 2024):

For HRI, the latent prior is a mixture over $K$ components, with parameters produced via a Mixture Density Network (MDN) conditioned on the "human" input $\mathbf{x}^h_t$ :

$p(\mathbf{z}_t | \mathbf{x}^h_t) = \sum_{k=1}^K \pi_k(\mathbf{x}^h_t) \mathcal{N}\left(\mathbf{z}_t | \boldsymbol{\mu}_k(\mathbf{x}^h_t), \operatorname{diag}(\boldsymbol{\sigma}_k^2(\mathbf{x}^h_t)) \right)$

The likelihood/decoder generates robot actions: $p(\mathbf{x}^r_t|\mathbf{z}_t) = \mathcal{N}(\mathbf{x}^r_t \mid \mathrm{Dec}(\mathbf{z}_t), \operatorname{diag}(\tau^2))$ .

2. Variational Inference and Training Objective

Across both frameworks, the inference model (recognition network) approximates the posterior over latents given observations.

Multimodal setting (Sutter et al., 2024):

Each encoder $q^m_\phi(z_m|x_m)$ is trained to approximate the latent for its modality.
The training objective is a multimodal ELBO:

$\mathcal{E}(X) = \mathbb{E}_{q_\phi(z|X)} [\log p_\theta(X|z)] - KL(q_\phi(z|X)\Vert h(z|X))$

with the KL regularization term factorizing across latent blocks and encoding a Jensen–Shannon (JS) divergence between unimodal posteriors.

MDN/conditional MoE (Prasad et al., 2024):

The encoder is restricted to the robot side: $q(\mathbf{z}_t|\mathbf{x}^r_t) = \mathcal{N}(\mathbf{z}_t|\boldsymbol{\mu}_\mathrm{enc}(\mathbf{x}^r_t), \operatorname{diag}(\boldsymbol{\sigma}_\mathrm{enc}^2(\mathbf{x}^r_t)))$ .
The ELBO per time step:

$\mathrm{ELBO}^r_t = \mathbb{E}_{q(\mathbf{z}_t|\mathbf{x}^r_t)}[\log p(\mathbf{x}^r_t|\mathbf{z}_t)] - \beta KL \left[ q(\mathbf{z}_t|\mathbf{x}^r_t) \| p(\mathbf{z}_t|\mathbf{x}^h_t) \right]$

Regularization term $\mathcal{L}^{\mathrm{sep}}_t$ includes: mean separation, temporal smoothness of expert means, and entropy on mixtures to prevent mode collapse.

3. Architectural and Algorithmic Details

Component	MM-VAMP VAE (Sutter et al., 2024)	MoVEInt (Prasad et al., 2024)
Encoder	MLP/ResNet per modality	FC $\to$ LeakyReLU $\to$ linear heads
Decoder	MLP/ResNet per modality	Mirrored MLP, outputs mean
Prior	Uniform mixture of posteriors	MDN: FC $\to$ GRU $\to$ expert heads
Mixture Weights	Uniform (1/M)	Data-driven (softmax over GRU output)
Latent Dim	Typically 32–128 per block	5 (interactions), 10 (handover)
Training	Adam, $200$–$1000$ epochs	Adam, $200$–$500$ epochs

In both cases, standard VAE tricks—such as reparametrized Gaussian sampling, minibatch-based stochastic optimization, and $\beta$ -weighted KL terms—are used.

4. Theoretical Properties and Regularization Strategies

A key distinction of MM-VAMP VAEs is the use of a soft, mixture-based prior, in contrast to hard sharing (e.g., Product-of-Experts, concatenation) or fixed priors. This mixture prior is shown in (Sutter et al., 2024) to maximize the ELBO among all data-dependent factorized priors, and the regularization term in the ELBO reduces to a scaled JS divergence between the unimodal posteriors:

$\sum_{m=1}^M KL(q^m \Vert \textstyle\frac{1}{M} \sum_k q^k) = M \cdot JS(q^1,...,q^M)$

This penalizes excessive divergence between modalities' latents but does not enforce collapse, thereby balancing shared against modality-specific structure.

For conditional multimodality (MoVEInt), mode collapse is further addressed by augmenting the loss with:

Mean separation penalty: encourages expert means to be apart.
Temporal smoothness: enforces expert means evolve smoothly.
Mixture entropy: discourages degenerate (low entropy) expert weights.

No empirical collapse of latents is observed even for small $\beta$ or aggressive training (Sutter et al., 2024), and regularization provides further stability in MDN-based instantiations (Prasad et al., 2024).

5. Empirical Results and Comparative Performance

Benchmark datasets (Sutter et al., 2024):

PolyMNIST: MM-VAMP achieves latent accuracy ≈ 0.92 at MSE = 8.1, compared to independent VAEs (0.80/10.5) and aggregation-based joint VAEs (0.78/8.2).
Bimodal CelebA and rodent CA1 neuroscience datasets: superior latent representation and imputation/coherence of missing modalities.

Human-robot interaction (Prasad et al., 2024):

Datasets: Multi-interaction (waving, handshake, fistbump) HRI/HHI tasks; “NuiSI” skeleton data; object handover trajectories.
MM-VAMP (MoVEInt) achieves lowest mean squared error (MSE) in 10 out of 12 human–robot pairs versus HMM-regularized VAEs (MILD) and LSTM baselines.
Example metric (waving gesture, HHI):
- MILD: 0.788 ± 1.226 (cm)
- LSTM: 4.121 ± 2.252 (cm)
- MoVEInt: 0.448 ± 0.630 (cm)
Real-world robot handover: 85% success rate (51/60) across naïve users and objects; failure attributed to perception/timing mismatches.

6. Relationships to Classical Methods and Limitations

The mixture-of-experts prior in MM-VAMP VAE for HRI is directly related to Gaussian Mixture Regression (GMR): the neural mixture prior implements the same conditional structure as GMR but allows the mixture parameters to be learned end-to-end via backpropagation and without explicit fitting of a joint GMM+HMM (Prasad et al., 2024).

In contrast to earlier approaches (e.g., HMM-based regularization or joint-posterior VAEs), MM-VAMP offers joint learning of all tasks, improved imputation/generation, and greater flexibility in information sharing. In the uniform-mixing multimodal case, no separate gating or attention is necessary; in MDN-based variants, a recurrent gating mechanism (e.g., GRU+softmax) generates data-dependent mixture weights.

A limitation is the computational cost of evaluating mixture KLs or training multiple modality-wise encoders when $M$ is large. For MoVEInt, the mixture model must be explicitly regularized against mode collapse, an issue less pressing in the soft-mixing multimodal aggregation setting.

7. Extensions, Applications, and Open Directions

Extensions of MM-VAMP are suggested but not explored in the cited works: nonuniform, learnable mixture weights or gating networks for asymmetric sharing; stacking of MM-VAMP blocks for hierarchical modeling; and hybridization with contrastive learning paradigms via the equivalence of the JS-divergence penalty to certain contrastive losses (Sutter et al., 2024).

The architecture is broadly applicable: from unsupervised and conditional multimodal representation learning (e.g., cross-modal image/text/brain data) to flexible, informative priors in time-series generative models and human–robot behavioral modeling. The demonstrated improvements in reconstruction accuracy, latent discriminability, and coherence in missing-modality imputation mark MM-VAMP as a foundation for future multimodal and conditional generative models.

References:

Prasad et al., "MoVEInt: Mixture of Variational Experts for Learning Human-Robot Interactions from Demonstrations" (Prasad et al., 2024)
"Unity by Diversity: Improved Representation Learning in Multimodal VAEs" (Sutter et al., 2024)