BASIL: Bayesian Gene-Set Latent Analysis

Updated 22 January 2026

BASIL is a scalable Bayesian factor analysis framework that integrates gene-set annotations to align latent variables with known biological pathways.
The method employs structured shrinkage priors and empirical Bayes estimation to extract interpretable transcriptional modules from RNA-seq data without costly MCMC sampling.
Empirical evaluations demonstrate BASIL’s competitive accuracy in covariance reconstruction and latent dimension selection, achieving ~30× speedups over traditional methods.

Bayesian Analysis with gene-Sets Informed Latent space (BASIL) is a scalable Bayesian factor modeling framework that integrates pathway (gene set) annotations directly into latent variable analysis for RNA-sequencing (RNA-seq) data. BASIL addresses challenges of dimensionality reduction in transcriptomic studies by imposing structured priors guided by gene-set information, enhancing interpretability and improving robustness. The method operates on normalized RNA-seq expression matrices and leverages gene–set membership data to both align latent components with known biological pathways and enable automatic discovery of novel regulatory modules, while also providing principled uncertainty quantification and removing the need for costly Markov chain Monte Carlo (MCMC) methods (Mauri et al., 19 Jan 2026).

1. Modeling Objectives and Input Structure

BASIL is designed for the analysis of transcriptomic datasets, utilizing as input an $n\times p$ matrix $Y$ corresponding to normalized RNA-seq measurements ( $n$ samples, $p$ genes), alongside a binary $p\times q$ matrix $C$ encoding gene–set membership ( $c_{j\ell} = 1$ if gene $j$ is in set $\ell$ ). The major objectives are to:

Learn low-dimensional latent representations of gene expression profiles.
Align latent factors directly with known gene sets (pathways).
Discover de novo modules—unstructured components unexplained by current pathway knowledge.
Provide uncertainty quantification for loading coefficients and gene–gene covariances.
Automate all hyperparameter tuning without user intervention.
Achieve computational efficiency by circumventing MCMC sampling.

This direct use of gene-set membership ensures that inferred factors are biologically interpretable and facilitates the identification of both known and novel transcriptional programs.

2. Generative Model Formulation

BASIL is built on a standard Bayesian factor analysis model for gene expression data. For each sample $i$ :

$y_i \mid \Lambda, \eta_i, \sigma^2 \sim N_p(\Lambda\,\eta_i,\,\sigma^2 I_p), \qquad \eta_i \sim N_k(0,I_k)$

Here, $\Lambda$ is the $p\times k$ factor loading matrix, $\eta_i$ are sample-specific $k$ -dimensional latent factors, and $\sigma^2 I_p$ models independent residual variance. The marginal covariance structure for each sample is:

$\mathrm{cov}(y_i) = \Lambda \Lambda^\top + \sigma^2 I_p$

This framework enables the decomposition of gene expression variation into a low-dimensional latent space, with explicit modeling of both signal and residual noise.

3. Incorporation of Gene Set Structure and Priors

Gene pathway information is incorporated through the prior structure on $\Lambda$ . BASIL decomposes the factor loading matrix as:

$\Lambda = C\,\Gamma + \Psi$

$C\Gamma$ ( $p \times k$ ): Represents the structured, gene-set-aligned loadings with $\Gamma$ a sparse $q \times k$ linkage matrix.
$\Psi$ ( $p \times k$ ): Captures the unstructured components lying in the null space of $C$ , i.e., de novo factors not aligned with any known gene set.

Independent shrinkage priors are specified for the coordinates associated with gene sets and the null space:

$\lambda_{\mathcal C,\ell}\mid\sigma^2 \sim N_k\left(0,\,\tau_\Gamma^2\,\sigma^2\,I_k\right), \quad \ell=1,\dots,q$

$\lambda_{\mathcal N,m}\mid\sigma^2 \sim N_k\left(0,\,\tau_\Psi^2\,\sigma^2\,I_k\right), \quad m=1,\dots,p-q$

$\sigma^2 \sim \mathrm{IG}\!\left(v_0/2,\,v_0\sigma_0^2/2\right)$

Gene–set structure enters through $C\Gamma$ : genes belonging to the same set share rows in $C\Gamma$ , tightly coupling the latent loadings to pathway memberships. The null-space component $\Psi$ allows BASIL to capture relevant biological signal not annotated in current databases.

4. Empirical Bayes Hyperparameter Estimation and Inference

All shrinkage hyperparameters and the residual variance are estimated via an automatic empirical Bayes procedure, which operates as follows:

Compute the rank- $k$ SVD: $Y = U D V^\top$ .
Define projections and loss metrics:

$L_{\mathcal C} = \frac{1}{n}\|P_C V D\|_F^2, \qquad L_{\mathcal N} = \frac{1}{n}\|(I_p-P_C) V D\|_F^2$

where $P_C = C(C^\top C)^{-1}C^\top$ .

Residual variance estimator:

$\hat\sigma^2 = \frac{\|(I_n - U U^\top)Y\|_2^2}{(n-k)p}$

Empirical Bayes scale parameters:

$\hat\tau_\Gamma^2 = \frac{L_{\mathcal C}}{k q \hat\sigma^2}, \qquad \hat\tau_\Psi^2 = \frac{L_{\mathcal N}}{k(p-q)\hat\sigma^2}$

A key property is that if, for example, the true unstructured component vanishes ( $\Psi = 0$ ), the corresponding regularization parameter ( $\hat\tau_\Psi^2$ ) converges to zero, conferring adaptive model selection.

Posterior inference proceeds without MCMC:

Latent factors are initialized via PCA.
Posterior updates for loadings and noise variance are available in closed form, leveraging conjugate regression.
Coverage correction is applied by inflating posterior variances to achieve nominal frequentist coverage.

5. Computational Efficiency and Algorithmic Workflow

BASIL's core inference is composed of singular value decomposition and closed-form posterior updates:

Initial SVD of $Y$ : computational cost $O(\min\{np^2,n^2p\})$ .
Conjugate regression updates for all loading elements: $O(p\,k^2)$ .
All updates for loading matrix columns proceed in parallel, with no need for MCMC sampling.
Final posterior variance inflation factors ( $\rho_{\mathcal C}^2$ , $\rho_{\mathcal N}^2 > 1$ ) are estimated from data.

This design ensures scalability to settings where the number of genes $p$ is much larger than the number of samples $n$ (typical of bulk RNA-seq studies), with empirical run times on the order of seconds for $p = 3000$ and speedups of approximately $30\times$ versus established competitors such as PLIER.

6. Uncertainty Quantification and Inferential Guarantees

The Bayesian formulation yields full posterior distributions for all parameters, enabling computation of credible intervals for:

Individual loading coefficients $\lambda_{jh}$ .
Any desired entry of the gene–gene covariance matrix $\Lambda \Lambda^\top + \sigma^2 I_p$ .
Sample-specific latent factors $\eta_i$ via conditional posterior:

$\eta_i \mid y_i, \Lambda, \sigma^2 \sim N_k\left((\Lambda^\top\Lambda+\sigma^2 I)^{-1}\Lambda^\top y_i,\ (1/\sigma^2 \Lambda^\top\Lambda+I)^{-1}\right)$

Simulation studies demonstrate that coverage of Bayesian credible intervals for key estimates approaches the nominal 95% level.

7. Performance Metrics and Comparative Assessment

BASIL’s effectiveness is evaluated by the following metrics:

Metric	BASIL Performance	Comparison
Covariance Reconstruction	Lowest error in high- and low-signal scenarios	Outperforms PLIER, ROTATE
Latent Dimension Selection	Recovers true $k$ via Joint-Likelihood Information Criterion (JIC)	PLIER tends to overestimate $k$
Run Time	Seconds even for $p = 3000$ ( $\sim 30\times$ faster than PLIER)
Out-of-Sample Log-Likelihood	Matches or improves upon PLIER; ROTATE often underperforms

Covariance reconstruction error is measured as $\|\widehat\Lambda\widehat\Lambda^\top - \Lambda_0\Lambda_0^\top\|_F/\|\Lambda_0\Lambda_0^\top\|_F$ , demonstrating substantial improvements in both accuracy and computational efficiency. BASIL’s model selection uses the Joint-Likelihood Information Criterion:

$\mathrm{JIC}(k) = -2\,\hat\ell_k + k\,\max(n, p)\,\log\{\min(n, p)\}$

This suggests that BASIL is robust to overfitting latent dimension $k$ and consistently identifies true model sizes (Mauri et al., 19 Jan 2026).

In summary, BASIL represents a fully Bayesian, pathway-informed factor analytic model for bulk RNA-seq expression, aligning latent variables with gene-set annotations, learning new biology, delivering rigorous uncertainty quantification, and achieving practical scalability without manual intervention.

Markdown Report Issue Upgrade to Chat

References (1)

Pathway-based Bayesian factor models for gene expression data (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Analysis with gene-Sets Informed Latent space (BASIL).