Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Study Sparse VAE

Updated 2 February 2026
  • The paper demonstrates that MSSVAE effectively disentangles shared and study-specific latent factors through a nonlinear, sparsity-driven encoder-decoder architecture.
  • Key methodologies include spike-and-slab lasso priors to induce selective feature-factor associations and variational inference via ELBO optimization.
  • The model yields interpretable solutions in genomics, with successful applications to multi-disease platelet transcriptome analysis and robust identifiability guarantees.

A Multi-Study Sparse Variational Autoencoder (MSSVAE) is a nonlinear probabilistic generative model designed to analyze high-dimensional datasets collected from multiple distinct studies or environments. Its architecture enables separation of latent factors into those shared across all studies and those specific to individual studies. The model is equipped with sparsity-inducing mechanisms for feature-factor associations and delivers identifiability guarantees under anchor-feature assumptions, supporting interpretable solutions in domains such as genomics. The approach was proposed and theoretically characterized in a multi-disease platelet transcriptome context by (Moran et al., 26 Jan 2026).

1. Generative Model for Multi-Study Factorization

Let m=1,,Mm=1,\dots,M index studies, i=1,,nmi=1,\dots,n_m index samples in study mm, and j=1,,Gj=1,\dots,G index features (e.g., genes). The MSSVAE posits:

  • A shared latent space zi(m)RKS\bm z_i^{(m)}\in\mathbb{R}^{K_S}, common across studies.
  • A study-specific latent space ζi(m)RKm\bm\zeta_i^{(m)}\in\mathbb{R}^{K_m} for each individual study.

Latent vectors for sample ii in study mm are stacked and zero-padded to form z~i(m)\widetilde{\bm z}_i^{(m)}, while feature-specific "mask" vectors w~j(m)\widetilde{\bm w}_j^{(m)} similarly encode which latent factors affect each feature. The generative process for each feature is:

xij(m)=fθ,j(w~j(m)z~i(m))+εij(m),εij(m)N(0,σj2)x_{ij}^{(m)} = f_{\theta,j}\left(\widetilde{\bm w}_j^{(m)} \odot \widetilde{\bm z}_i^{(m)}\right) + \varepsilon_{ij}^{(m)}, \quad \varepsilon_{ij}^{(m)} \sim N(0, \sigma_j^2)

where fθf_{\theta} is a feed-forward neural network, and the masked product allows selective dependence of features on particular latent factors. The full data joint density is a product over studies, samples, and features, parameterized by learned weights, sparsity variables, and noise terms.

An alternative negative binomial (NB) version admits count-valued observations and models xij(m)NB(μ=i(m)fθ,j(),ϕj)x_{ij}^{(m)} \sim \mathrm{NB}(\mu = \ell_i^{(m)}\,f_{\theta,j}(\cdot), \phi_j), where i(m)\ell_i^{(m)} represents sample-specific library size.

2. Sparsity-Inducing Priors and Mask Structure

MSSVAE employs spike-and-slab lasso priors to ensure that each feature jj depends only on a small subset of the possible latent variables. Binary indicator variables γjk\gamma_{jk} determine whether a given weight wjkw_{jk} is "active" (low decay, λ1\lambda_1) or heavily shrunk (high decay, λ0\lambda_0), with λ0λ1\lambda_0 \gg \lambda_1:

p(wjkγjk)=γjkλ12eλ1wjk+(1γjk)λ02eλ0wjkp(w_{jk} \mid \gamma_{jk}) = \gamma_{jk} \frac{\lambda_1}{2} e^{-\lambda_1|w_{jk}|} + (1-\gamma_{jk}) \frac{\lambda_0}{2} e^{-\lambda_0|w_{jk}|}

Indicators γjk\gamma_{jk} follow Bernoulli distributions with beta priors on activity probabilities ηk\eta_k,

γjkBernoulli(ηk),ηkBeta(aS,bS)\gamma_{jk} \sim \mathrm{Bernoulli}(\eta_k), \qquad \eta_k \sim \mathrm{Beta}(a_S, b_S)

Integrating out the binary and beta priors yields an adaptive 1\ell_1-type penalty on WW, automatically limiting the effective number of active columns. Finite Indian Buffet Process (IBP) behavior can be induced by the hyperparameter setting aS1/Ga_S \propto 1/G and bS=1b_S=1.

3. Encoder–Decoder Architecture

The model architecture comprises:

  • Encoders: Each study has parameter sets (ψS,ψm)(\psi_S, \psi_m) for shared and study-specific latents, producing normal distributions via two-layer ReLU MLPs with batch normalization:

qψS(zi(m)xi(m))=N(μψS(xi(m)),diag(σψS2(xi(m))))q_{\psi_S}(z_i^{(m)} \mid x_i^{(m)}) = N(\mu_{\psi_S}(x_i^{(m)}), \mathrm{diag}(\sigma_{\psi_S}^2(x_i^{(m)})))

qψm(ζi(m)xi(m))=N(μψm(xi(m)),diag(σψm2(xi(m))))q_{\psi_m}(\zeta_i^{(m)} \mid x_i^{(m)}) = N(\mu_{\psi_m}(x_i^{(m)}), \mathrm{diag}(\sigma_{\psi_m}^2(x_i^{(m)})))

  • Decoder: A two-layer ReLU MLP fθf_\theta, shared across all studies, receiving masked latent inputs via skip connections. The output layer is linear (for Gaussian likelihoods) or softplus (for NB-MSSVAE). Zero-padding and masking in w~j(m)z~i(m)\widetilde w_j^{(m)} \odot \widetilde z_i^{(m)} ensure flexible factor-to-feature mapping.

4. Variational Inference and Optimization

Inference is performed via a hybrid scheme:

  • E-step: Latent variables are sampled using the reparameterization trick; the expected indicator values E[γjkwjk,ηk]\mathbb{E}[\gamma_{jk} | w_{jk}, \eta_k] are computed in closed form.
  • M-step: Network parameters (ψS,ψm,θ)(\psi_S, \psi_m, \theta), weight matrix WW, sparsity variables η\eta, and noise variances Σ\Sigma are updated via stochastic gradient ascent on the evidence lower bound (ELBO) using Adam optimizer. The spike hyperparameter λ0\lambda_0 is annealed during early epochs to stabilize training and prevent information switching between shared and study-specific decoders.

The ELBO for the Gaussian setting is:

L=m=1Mi=1nm{EqψSqψm[logpθ(xi(m)zi(m),ζi(m))]DKL(qψS(zi(m))p(zi(m)))DKL(qψm(ζi(m))p(ζi(m)))}+mask prior terms\mathcal{L} = \sum_{m=1}^M \sum_{i=1}^{n_m} \Big\{ \mathbb{E}_{q_{\psi_S}q_{\psi_m}} \left[ \log p_\theta(x_i^{(m)} \mid z_i^{(m)}, \zeta_i^{(m)}) \right] - D_{KL}(q_{\psi_S}(z_i^{(m)}) \Vert p(z_i^{(m)})) - D_{KL}(q_{\psi_m}(\zeta_i^{(m)}) \Vert p(\zeta_i^{(m)})) \Big\} + \text{mask prior terms}

The NB-MSSVAE replaces the likelihood and adjusts KL-divergences accordingly.

5. Identifiability Theorems

Under anchor-feature assumptions and mild non-degeneracy conditions:

  • True latent dimensionality KK and the set of anchor features (features depending on exactly one latent variable) are identifiable by analysis of the marginal correlation matrix.
  • Noise variances σj2\sigma_j^2 are estimable via deconvolution on anchor coordinates.
  • Correct support of WW (the masking matrix) is inferred by testing conditional variances: erroneous assignments alter conditional variances of data features.
  • For multi-study data, identifiability proceeds by applying single-study results to each study, then assembling shared support by tracking coinciding columns; remaining columns encode study-specific factors.

6. Application to Multi-Disease Platelet Transcriptomes

The NB-MSSVAE was applied to platelet gene expression profiles sampled from N=1463N=1463 patients across M=6M=6 disease groups (healthy, cardiovascular, MS, NSCLC, glioblastoma, other cancer), focusing on the G=5000G=5000 most-variable genes. Model settings included initial KS=50K_S=50 (shared), Km=10K_m=10 (per-study), and optimization for 800 epochs.

For each latent kk with gene set Ck={j:Wjk>0.5}\mathcal{C}_k = \{j : |W_{jk}| > 0.5\}, gene-ontology enrichment was assessed via Fisher’s exact test (BY-adjusted FDR < 0.05):

  • Shared factors: 96% of KSK_S clusters significantly enriched for at least one GO term, encompassing hemostasis, thrombosis, innate immunity, metabolism, protein synthesis.
  • Disease-specific factors: 56% enriched (noting small clusters), with enrichments matching disease context (oxidative stress in cardiovascular, interferon/viral response in MS, apoptotic & NFκB signaling in cancers, housekeeping in healthy).

A plausible implication is that MSSVAE delivers meaningful biological insights with interpretable, data-driven factor structure and guarantees on recovery of genuine factor and mask identities under modeled assumptions.

7. Summary and Significance

The Multi-Study Sparse Variational Autoencoder constitutes a flexible framework supporting nonlinear factor analysis in multi-study high-dimensional datasets. Core virtues include:

  • Modelling nonlinear, shared and study-specific factor structures via neural networks.
  • Ensuring interpretability and parsimony through adaptive spike-and-slab sparsity priors.
  • Rigorous identifiability results under anchor-feature conditions.
  • Demonstrated recovery of biologically significant variants and pathways in genomics.

The approach offers an effective paradigm for dissecting heterogeneous, high-dimensional observational studies where separation of shared and study-specific modes is critical (Moran et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Study Sparse Variational Autoencoder.