Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 148 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

miVAE: Multi-Modal Identifiable VAE

Updated 12 November 2025
  • The paper introduces miVAE, which unifies neural activity and visual stimuli by disentangling shared and modality-specific latent spaces.
  • It employs a two-level disentanglement strategy with identifiable exponential-family priors and cross-modal losses for zero-shot alignment across subjects.
  • Experimental validation shows high cross-modal alignment (Pearson-R > 0.9) and effective score-based attribution for both neural and stimulus features.

The multi-modal identifiable variational autoencoder (miVAE) is a generative modeling framework developed to unify and disentangle the latent structure underlying complex visual stimuli and simultaneously recorded neural responses in the primary visual cortex (V1), with explicit accommodation for cross-individual variability and modality-specific structure. By employing a two-level disentanglement strategy coupled with identifiable exponential-family priors, miVAE enables robust cross-modal alignment and interpretable analysis of neural coding, supporting zero-shot generalization across subjects without the need for subject-specific fine-tuning.

1. Model Structure and Generative Assumptions

miVAE models two observed modalities: neural population activity nRN×Tn\in\mathbb{R}^{N\times T}, recorded from NN neurons over TT time points, and dynamic visual stimuli xRH×W×Tx\in\mathbb{R}^{H\times W\times T}, represented as movie frames. Each modality is processed by distinct neural networks (encoders) that parameterize (possibly time-varying) distributions over distinct subspaces of latent variables:

  • The neural encoder qϕ(znn,u)q_{\phi}(z_n\,|\,n,u) maps neural activity and per-subject neuron coordinates uR3×Nu\in\mathbb{R}^{3\times N} to the neural-specific latent znRd/2×Tz_n\in\mathbb{R}^{d/2\times T}.
  • The visual encoder qψ(zx,zsx)q_{\psi}(z_x, z_s\,|\,x) maps the visual stimulus into stimulus-specific latents zxRd/2×Tz_x\in\mathbb{R}^{d/2\times T} and a shared latent zsRd/2×Tz_s\in\mathbb{R}^{d/2\times T}.

The generative model is defined as: p(x,n,zs,zx,zn)=p(zs)p(zx)p(zn)p(nzs,zn)p(xzs,zx)p(x,n,z_s,z_x,z_n) = p(z_s) \, p(z_x) \, p(z_n) \, p(n\,|\,z_s,z_n) \, p(x\,|\,z_s,z_x) with exponential-family priors: p(znu)=i=1d/2Qi(zn,i)Zi(u)exp[Ti(zn,i)λi(u)]p(z_n\,|\,u) = \prod_{i=1}^{d/2} \frac{Q_i(z_{n,i})}{Z_i(u)} \exp[T_i(z_{n,i})\,\lambda_i(u)] p(zsx)p(z_s|x), p(zsn)p(z_s|n) are similarly parameterized, and p(zx)=N(0,I)p(z_x) = \mathcal{N}(0, I). Conditional decoders pθ(nzs,zn)p_{\theta}(n|z_s, z_n) and pϑ(xzs,zx)p_{\vartheta}(x|z_s, z_x) reconstruct each observed modality from their respective latents.

2. Two-Level Disentanglement and Identifiability

The latent space is partitioned into a shared component (zsz_s), invariant across modalities and individuals, and modality-specific subspaces: neural-specific (znz_n) and stimulus-specific (zxz_x). The neural-specific latent znz_n captures idiosyncratic neuroanatomical and functional properties of individual subjects (via coordinate-dependent priors), while zxz_x captures features of the stimulus uncorrelated with neural responses. The shared latent zsz_s encodes the stimulus-driven features common across individuals and modalities.

Identifiability is achieved under the identifiable VAE (iVAE) framework by enforcing:

  • Conditional independence: znx(n,u)z_n\perp x\,|\,(n,u), zsu(n,x)z_s\perp u\,|\,(n,x), zxnxz_x\perp n\,|x
  • Sufficiently rich exponential-family priors for znz_n and zsz_s
  • KL regularization to enforce proximity of inferred posteriors to their respective conditional priors

The multi-modal loss includes KL-divergence penalties for each latent and negative expected log-likelihoods (reconstructions). For example: LMM=Eq(zn,zsn,x)[logp(nzs,zn)]+KL[q(znn,u)p(znu)]+KL[q(zsn,x)p(zsx)]+\mathcal{L}_{\mathrm{MM}} = -\mathbb{E}_{q(z_n,z_s|n,x)}[\log p(n|z_s,z_n)] + \mathrm{KL}[q(z_n|n,u)\,||\,p(z_n|u)] + \mathrm{KL}[q(z_s|n,x)\,||\,p(z_s|x)]+\cdots

3. Variational Objective and Inference Procedures

Variational inference in miVAE is implemented using factorized (mean-field) variational distributions over latents conditioned on the observed modalities. The evidence lower bound (ELBO) combines the reconstruction and KL terms: L=Eq[logp(nzs,zn)+logp(xzs,zx)]KL(q(zn,zs,zxn,x)p(zn,zs,zx))\mathcal{L} = \mathbb{E}_q[\log p(n|z_s,z_n) + \log p(x|z_s,z_x)] - \mathrm{KL}\big(q(z_n,z_s,z_x|n,x)\,||\,p(z_n,z_s,z_x)\big) where the joint prior factorizes as p(znu)p(zsx)p(zx)p(z_n|u)p(z_s|x)p(z_x).

A novel aspect is the cross-modal loss, which swaps the conditional priors of one modality as the posterior for the other, reinforcing alignment in the shared latent space: LCM=Eq(znu)q(zsx)[logp(nzs,zn)]+KL[q(znu)q(znn,u)]+KL[q(zsx)q(zsn,x)]+\mathcal{L}_{\mathrm{CM}} = -\mathbb{E}_{q(z_n|u)q(z_s|x)}[\log p(n|z_s,z_n)] + \mathrm{KL}[q(z_n|u)\,||\,q(z_n|n,u)] + \mathrm{KL}[q(z_s|x)\,||\,q(z_s|n,x)] + \cdots

This structure ensures that zsz_s captures the common signal linking neural and stimulus domains, while znz_n and zxz_x absorb domain-specific variations.

4. Cross-Individual and Cross-Modal Alignment

A primary aim is zero-shot alignment of shared neural representations (zsz_s) across individuals. The model uses subject-specific priors for znz_n but conditions zsz_s solely on observed stimuli (or population activity), ensuring transferability of the learned manifold.

After training, the shared latent distributions from neural and stimulus encoders are aligned by minimal affine transformations: zsn=A1zsx+b1,zsx=A2zsn+b2z_s^n = A_1 z_s^x + b_1, \qquad z_s^x = A_2 z_s^n + b_2 with matrices and offsets fitted to match first and second moments (or optimized with a KL loss). This procedure yields cross-correlation scores exceeding 0.90 for held-out individuals and stimuli.

Experimental results demonstrate that miVAE achieves Pearson-RR scores of 0.8694 (Stage 1 encoding), 0.8809 (linear mapping), 0.9149 (nonlinear coding), and 0.8984–0.9635 (cross-individual zsz_s alignment). Combining both multi-modal and cross-modal losses is necessary; removal of the neural-specific latent sharply impairs performance. A dimension d=16d=16 (8 shared, 4+4 idiosyncratic) was found optimal.

5. Score-Based Attribution Mechanism

miVAE introduces a score-based attribution method to assign importance weights to neural units or stimulus features associated with each shared latent dimension. For identifying neural contributions to a shared latent, the Fisher score

nlogp(zsxn)nlogp(nzsx)nlogp(n)\nabla_n \log p(z_s^x|n) \approx \nabla_n \log p(n|z_s^x) - \nabla_n \log p(n)

is computed, with the marginal term typically approximated or omitted. The elementwise gradient magnitudes act as importance scores, partitioning the neural population into functionally distinct subpopulations.

On the stimulus side, analogous gradients (xlogp(xzsn)\nabla_x \log p(x|z_s^n)) highlight spatial/temporal regions in the visual input that drive the shared code. Empirical analysis revealed that the most "important" neuron subset (approximately 700 units) achieves higher classification accuracy (91.29%) than the non-selected group (82.92%) or the full population (87.24%), evidencing refined discriminative specificity. Attribution maps in the stimulus domain preferentially identify edge- and luminance-sensitive movie regions.

6. Experimental Validation and Quantitative Findings

The evaluation protocol used the Sensorium 2023 V1 dataset (10 mice, ~78,000 neurons, 30 Hz, 36×64 movies), with training on data from 7 mice and testing on 3 held-out animals for zero-shot transfer assessment. Calcium traces were preprocessed to deconvolved spike rates and temporally aligned to video frames.

Training employed AdamW (batch 32, learning rate 10410^{-4}, 400 epochs, cosine annealing) on eight A100 GPUs. Key outcomes include:

Task/Stage Pearson-R (RR) Notes
Stage 1 encoding 0.8694 miVAE neural decoding
Stage 2 latent coding 0.8809 (linear), 0.9149 (nonlinear) alignment transforms
Cross-individual zsz_s 0.8984 (Stage 1), 0.9635 (Stage 2, nonlinear) transfer
Neuron selection accuracy 91.29% (attribution-selected), 82.92% (complement), 87.24% (all) stimulus classification

Ablation confirmed necessity of both the cross-modal and multi-modal losses, and of the neural-specific latent path. Larger dataset size (more mice, more trials, smaller neuron subsets) improved performance monotonically.

7. Interpretability and Broader Applicability

The disentangled shared latent (zsz_s) in miVAE encodes reproducible, stimulus-specific features robustly invariant to individual heterogeneity. Score-based attribution yields interpretable, distinct neural subpopulations with differing temporal dynamics and high stimulus discrimination capacity, while stimulus-side attribution emphasizes regions relevant for primary visual processing (e.g., those sensitive to spatial edges or luminance).

miVAE’s structure—leveraging modality-agnostic, identifiable priors and bidirectional multi-modal/cross-modal losses—is generalizable beyond V1. It can be adapted for analysis of other sensory cortices (auditory, somatosensory) and for integrative models combining behavioral and neural data in decision-making contexts. Applicability is ensured by defining appropriate domain-specific priors for each measurement setup, while the shared latent is always extracted via cross-modal variational objectives.

In summary, miVAE synthesizes two-level disentanglement, identifiable exponential-family probabilistic structure, advanced cross-modal variational learning, and attribution-based interpretability, providing a scalable and generalizable framework for the neurocomputational modeling of sensory representations and their individual-specific manifestations across large subject cohorts.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-Modal Identifiable Variational Autoencoder (miVAE).