Multi-Study Sparse VAE
- The paper demonstrates that MSSVAE effectively disentangles shared and study-specific latent factors through a nonlinear, sparsity-driven encoder-decoder architecture.
- Key methodologies include spike-and-slab lasso priors to induce selective feature-factor associations and variational inference via ELBO optimization.
- The model yields interpretable solutions in genomics, with successful applications to multi-disease platelet transcriptome analysis and robust identifiability guarantees.
A Multi-Study Sparse Variational Autoencoder (MSSVAE) is a nonlinear probabilistic generative model designed to analyze high-dimensional datasets collected from multiple distinct studies or environments. Its architecture enables separation of latent factors into those shared across all studies and those specific to individual studies. The model is equipped with sparsity-inducing mechanisms for feature-factor associations and delivers identifiability guarantees under anchor-feature assumptions, supporting interpretable solutions in domains such as genomics. The approach was proposed and theoretically characterized in a multi-disease platelet transcriptome context by (Moran et al., 26 Jan 2026).
1. Generative Model for Multi-Study Factorization
Let index studies, index samples in study , and index features (e.g., genes). The MSSVAE posits:
- A shared latent space , common across studies.
- A study-specific latent space for each individual study.
Latent vectors for sample in study are stacked and zero-padded to form , while feature-specific "mask" vectors similarly encode which latent factors affect each feature. The generative process for each feature is:
where is a feed-forward neural network, and the masked product allows selective dependence of features on particular latent factors. The full data joint density is a product over studies, samples, and features, parameterized by learned weights, sparsity variables, and noise terms.
An alternative negative binomial (NB) version admits count-valued observations and models , where represents sample-specific library size.
2. Sparsity-Inducing Priors and Mask Structure
MSSVAE employs spike-and-slab lasso priors to ensure that each feature depends only on a small subset of the possible latent variables. Binary indicator variables determine whether a given weight is "active" (low decay, ) or heavily shrunk (high decay, ), with :
Indicators follow Bernoulli distributions with beta priors on activity probabilities ,
Integrating out the binary and beta priors yields an adaptive -type penalty on , automatically limiting the effective number of active columns. Finite Indian Buffet Process (IBP) behavior can be induced by the hyperparameter setting and .
3. Encoder–Decoder Architecture
The model architecture comprises:
- Encoders: Each study has parameter sets for shared and study-specific latents, producing normal distributions via two-layer ReLU MLPs with batch normalization:
- Decoder: A two-layer ReLU MLP , shared across all studies, receiving masked latent inputs via skip connections. The output layer is linear (for Gaussian likelihoods) or softplus (for NB-MSSVAE). Zero-padding and masking in ensure flexible factor-to-feature mapping.
4. Variational Inference and Optimization
Inference is performed via a hybrid scheme:
- E-step: Latent variables are sampled using the reparameterization trick; the expected indicator values are computed in closed form.
- M-step: Network parameters , weight matrix , sparsity variables , and noise variances are updated via stochastic gradient ascent on the evidence lower bound (ELBO) using Adam optimizer. The spike hyperparameter is annealed during early epochs to stabilize training and prevent information switching between shared and study-specific decoders.
The ELBO for the Gaussian setting is:
The NB-MSSVAE replaces the likelihood and adjusts KL-divergences accordingly.
5. Identifiability Theorems
Under anchor-feature assumptions and mild non-degeneracy conditions:
- True latent dimensionality and the set of anchor features (features depending on exactly one latent variable) are identifiable by analysis of the marginal correlation matrix.
- Noise variances are estimable via deconvolution on anchor coordinates.
- Correct support of (the masking matrix) is inferred by testing conditional variances: erroneous assignments alter conditional variances of data features.
- For multi-study data, identifiability proceeds by applying single-study results to each study, then assembling shared support by tracking coinciding columns; remaining columns encode study-specific factors.
6. Application to Multi-Disease Platelet Transcriptomes
The NB-MSSVAE was applied to platelet gene expression profiles sampled from patients across disease groups (healthy, cardiovascular, MS, NSCLC, glioblastoma, other cancer), focusing on the most-variable genes. Model settings included initial (shared), (per-study), and optimization for 800 epochs.
For each latent with gene set , gene-ontology enrichment was assessed via Fisher’s exact test (BY-adjusted FDR < 0.05):
- Shared factors: 96% of clusters significantly enriched for at least one GO term, encompassing hemostasis, thrombosis, innate immunity, metabolism, protein synthesis.
- Disease-specific factors: 56% enriched (noting small clusters), with enrichments matching disease context (oxidative stress in cardiovascular, interferon/viral response in MS, apoptotic & NFκB signaling in cancers, housekeeping in healthy).
A plausible implication is that MSSVAE delivers meaningful biological insights with interpretable, data-driven factor structure and guarantees on recovery of genuine factor and mask identities under modeled assumptions.
7. Summary and Significance
The Multi-Study Sparse Variational Autoencoder constitutes a flexible framework supporting nonlinear factor analysis in multi-study high-dimensional datasets. Core virtues include:
- Modelling nonlinear, shared and study-specific factor structures via neural networks.
- Ensuring interpretability and parsimony through adaptive spike-and-slab sparsity priors.
- Rigorous identifiability results under anchor-feature conditions.
- Demonstrated recovery of biologically significant variants and pathways in genomics.
The approach offers an effective paradigm for dissecting heterogeneous, high-dimensional observational studies where separation of shared and study-specific modes is critical (Moran et al., 26 Jan 2026).