miVAE: Multi-Modal Identifiable VAE

Updated 12 November 2025

The paper introduces miVAE, which unifies neural activity and visual stimuli by disentangling shared and modality-specific latent spaces.
It employs a two-level disentanglement strategy with identifiable exponential-family priors and cross-modal losses for zero-shot alignment across subjects.
Experimental validation shows high cross-modal alignment (Pearson-R > 0.9) and effective score-based attribution for both neural and stimulus features.

The multi-modal identifiable variational autoencoder (miVAE) is a generative modeling framework developed to unify and disentangle the latent structure underlying complex visual stimuli and simultaneously recorded neural responses in the primary visual cortex (V1), with explicit accommodation for cross-individual variability and modality-specific structure. By employing a two-level disentanglement strategy coupled with identifiable exponential-family priors, miVAE enables robust cross-modal alignment and interpretable analysis of neural coding, supporting zero-shot generalization across subjects without the need for subject-specific fine-tuning.

1. Model Structure and Generative Assumptions

miVAE models two observed modalities: neural population activity $n\in\mathbb{R}^{N\times T}$ , recorded from $N$ neurons over $T$ time points, and dynamic visual stimuli $x\in\mathbb{R}^{H\times W\times T}$ , represented as movie frames. Each modality is processed by distinct neural networks (encoders) that parameterize (possibly time-varying) distributions over distinct subspaces of latent variables:

The neural encoder $q_{\phi}(z_n\,|\,n,u)$ maps neural activity and per-subject neuron coordinates $u\in\mathbb{R}^{3\times N}$ to the neural-specific latent $z_n\in\mathbb{R}^{d/2\times T}$ .
The visual encoder $q_{\psi}(z_x, z_s\,|\,x)$ maps the visual stimulus into stimulus-specific latents $z_x\in\mathbb{R}^{d/2\times T}$ and a shared latent $z_s\in\mathbb{R}^{d/2\times T}$ .

The generative model is defined as: $p(x,n,z_s,z_x,z_n) = p(z_s) \, p(z_x) \, p(z_n) \, p(n\,|\,z_s,z_n) \, p(x\,|\,z_s,z_x)$ with exponential-family priors: $p(z_n\,|\,u) = \prod_{i=1}^{d/2} \frac{Q_i(z_{n,i})}{Z_i(u)} \exp[T_i(z_{n,i})\,\lambda_i(u)]$ $p(z_s|x)$ , $p(z_s|n)$ are similarly parameterized, and $p(z_x) = \mathcal{N}(0, I)$ . Conditional decoders $p_{\theta}(n|z_s, z_n)$ and $p_{\vartheta}(x|z_s, z_x)$ reconstruct each observed modality from their respective latents.

2. Two-Level Disentanglement and Identifiability

The latent space is partitioned into a shared component ( $z_s$ ), invariant across modalities and individuals, and modality-specific subspaces: neural-specific ( $z_n$ ) and stimulus-specific ( $z_x$ ). The neural-specific latent $z_n$ captures idiosyncratic neuroanatomical and functional properties of individual subjects (via coordinate-dependent priors), while $z_x$ captures features of the stimulus uncorrelated with neural responses. The shared latent $z_s$ encodes the stimulus-driven features common across individuals and modalities.

Identifiability is achieved under the identifiable VAE (iVAE) framework by enforcing:

Conditional independence: $z_n\perp x\,|\,(n,u)$ , $z_s\perp u\,|\,(n,x)$ , $z_x\perp n\,|x$
Sufficiently rich exponential-family priors for $z_n$ and $z_s$
KL regularization to enforce proximity of inferred posteriors to their respective conditional priors

The multi-modal loss includes KL-divergence penalties for each latent and negative expected log-likelihoods (reconstructions). For example: $\mathcal{L}_{\mathrm{MM}} = -\mathbb{E}_{q(z_n,z_s|n,x)}[\log p(n|z_s,z_n)] + \mathrm{KL}[q(z_n|n,u)\,||\,p(z_n|u)] + \mathrm{KL}[q(z_s|n,x)\,||\,p(z_s|x)]+\cdots$

3. Variational Objective and Inference Procedures

Variational inference in miVAE is implemented using factorized (mean-field) variational distributions over latents conditioned on the observed modalities. The evidence lower bound (ELBO) combines the reconstruction and KL terms: $\mathcal{L} = \mathbb{E}_q[\log p(n|z_s,z_n) + \log p(x|z_s,z_x)] - \mathrm{KL}\big(q(z_n,z_s,z_x|n,x)\,||\,p(z_n,z_s,z_x)\big)$ where the joint prior factorizes as $p(z_n|u)p(z_s|x)p(z_x)$ .

A novel aspect is the cross-modal loss, which swaps the conditional priors of one modality as the posterior for the other, reinforcing alignment in the shared latent space: $\mathcal{L}_{\mathrm{CM}} = -\mathbb{E}_{q(z_n|u)q(z_s|x)}[\log p(n|z_s,z_n)] + \mathrm{KL}[q(z_n|u)\,||\,q(z_n|n,u)] + \mathrm{KL}[q(z_s|x)\,||\,q(z_s|n,x)] + \cdots$

This structure ensures that $z_s$ captures the common signal linking neural and stimulus domains, while $z_n$ and $z_x$ absorb domain-specific variations.

A primary aim is zero-shot alignment of shared neural representations ( $z_s$ ) across individuals. The model uses subject-specific priors for $z_n$ but conditions $z_s$ solely on observed stimuli (or population activity), ensuring transferability of the learned manifold.

After training, the shared latent distributions from neural and stimulus encoders are aligned by minimal affine transformations: $z_s^n = A_1 z_s^x + b_1, \qquad z_s^x = A_2 z_s^n + b_2$ with matrices and offsets fitted to match first and second moments (or optimized with a KL loss). This procedure yields cross-correlation scores exceeding 0.90 for held-out individuals and stimuli.

Experimental results demonstrate that miVAE achieves Pearson- $R$ scores of 0.8694 (Stage 1 encoding), 0.8809 (linear mapping), 0.9149 (nonlinear coding), and 0.8984–0.9635 (cross-individual $z_s$ alignment). Combining both multi-modal and cross-modal losses is necessary; removal of the neural-specific latent sharply impairs performance. A dimension $d=16$ (8 shared, 4+4 idiosyncratic) was found optimal.

5. Score-Based Attribution Mechanism

miVAE introduces a score-based attribution method to assign importance weights to neural units or stimulus features associated with each shared latent dimension. For identifying neural contributions to a shared latent, the Fisher score

$\nabla_n \log p(z_s^x|n) \approx \nabla_n \log p(n|z_s^x) - \nabla_n \log p(n)$

is computed, with the marginal term typically approximated or omitted. The elementwise gradient magnitudes act as importance scores, partitioning the neural population into functionally distinct subpopulations.

On the stimulus side, analogous gradients ( $\nabla_x \log p(x|z_s^n)$ ) highlight spatial/temporal regions in the visual input that drive the shared code. Empirical analysis revealed that the most "important" neuron subset (approximately 700 units) achieves higher classification accuracy (91.29%) than the non-selected group (82.92%) or the full population (87.24%), evidencing refined discriminative specificity. Attribution maps in the stimulus domain preferentially identify edge- and luminance-sensitive movie regions.

6. Experimental Validation and Quantitative Findings

The evaluation protocol used the Sensorium 2023 V1 dataset (10 mice, ~78,000 neurons, 30 Hz, 36×64 movies), with training on data from 7 mice and testing on 3 held-out animals for zero-shot transfer assessment. Calcium traces were preprocessed to deconvolved spike rates and temporally aligned to video frames.

Training employed AdamW (batch 32, learning rate $10^{-4}$ , 400 epochs, cosine annealing) on eight A100 GPUs. Key outcomes include:

Task/Stage	Pearson-R ( $R$ )	Notes
Stage 1 encoding	0.8694	miVAE neural decoding
Stage 2 latent coding	0.8809 (linear), 0.9149 (nonlinear)	alignment transforms
Cross-individual $z_s$	0.8984 (Stage 1), 0.9635 (Stage 2, nonlinear)	transfer
Neuron selection accuracy	91.29% (attribution-selected), 82.92% (complement), 87.24% (all)	stimulus classification

Ablation confirmed necessity of both the cross-modal and multi-modal losses, and of the neural-specific latent path. Larger dataset size (more mice, more trials, smaller neuron subsets) improved performance monotonically.

7. Interpretability and Broader Applicability

The disentangled shared latent ( $z_s$ ) in miVAE encodes reproducible, stimulus-specific features robustly invariant to individual heterogeneity. Score-based attribution yields interpretable, distinct neural subpopulations with differing temporal dynamics and high stimulus discrimination capacity, while stimulus-side attribution emphasizes regions relevant for primary visual processing (e.g., those sensitive to spatial edges or luminance).

miVAE’s structure—leveraging modality-agnostic, identifiable priors and bidirectional multi-modal/cross-modal losses—is generalizable beyond V1. It can be adapted for analysis of other sensory cortices (auditory, somatosensory) and for integrative models combining behavioral and neural data in decision-making contexts. Applicability is ensured by defining appropriate domain-specific priors for each measurement setup, while the shared latent is always extracted via cross-modal variational objectives.

In summary, miVAE synthesizes two-level disentanglement, identifiable exponential-family probabilistic structure, advanced cross-modal variational learning, and attribution-based interpretability, providing a scalable and generalizable framework for the neurocomputational modeling of sensory representations and their individual-specific manifestations across large subject cohorts.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Multi-Modal Identifiable Variational Autoencoder (miVAE).