Biology-Informed Bayesian Optimization

Updated 25 September 2025

BioBO is a framework that integrates Bayesian optimization with biological knowledge, using multimodal gene embeddings and pathway priors to efficiently design gene perturbation experiments.
It modifies traditional Bayesian acquisition functions by incorporating statistical gene set enrichment, leading to 25–40% improved labeling efficiency over conventional methods.
The approach enhances mechanistic interpretability by linking selected gene perturbations to enriched regulatory pathways, as evidenced in public CRISPR screening benchmarks.

Biology-Informed Bayesian Optimization (BioBO) is a methodological framework that integrates Bayesian optimization techniques with domain-specific biological knowledge to efficiently design, prioritize, and interpret perturbation experiments in genomics and biomedicine. The primary motivation is to improve sample efficiency and mechanistic interpretability in experimental settings where the search space is vast (e.g., genome-wide CRISPR perturbations), the cost of function evaluation is high, and prior biological information about gene–gene relationships, molecular pathways, or functional annotations can be leveraged to bias both modeling and search. BioBO achieves this by embedding multimodal biological data into its surrogate models and modifying acquisition strategies using statistical gene set enrichment analysis. Empirical results indicate superior labeling efficiency and mechanistic interpretability compared with conventional BO methods, as demonstrated in public genomic perturbation benchmarks (Li et al., 24 Sep 2025).

1. Problem Formulation and Motivation

Genome-wide perturbation experiments are foundational to drug discovery and target identification, but exhaustively exploring the entire gene space (∼20,000 genes) is impractical. The core objective in BioBO is to optimize an expensive black-box function $f(x)$ , where $x$ represents a gene or gene embedding, by selecting a minimal set of perturbations that maximize experimental utility (e.g., inducing a desired phenotype). This is mathematically formalized as:

$g^* \in \arg\max_x f(x)$

Standard Bayesian optimization iteratively builds a surrogate model (such as a Gaussian process or Bayesian neural network) to predict $f(x)$ given labeled data, then selects new candidates via acquisition functions (e.g., expected improvement, upper confidence bound, or Thompson sampling). However, in biological domains, leveraging prior knowledge regarding pathway memberships, gene function, and regulatory networks can fundamentally bias both modeling and candidate selection towards biologically plausible hypotheses.

2. Integration of Multimodal Biological Data

Unlike conventional BO approaches that rely on singular gene representations, BioBO incorporates multiple modalities of biological information into gene embeddings. These modalities include:

Sequence descriptors (e.g., Achilles): Encodes gene identity and basic sequence features.
Gene2Vec embeddings: Captures gene–gene relationships from gene ontology annotations, thus representing functional similarity.
GenePT embeddings: Generated via LLMs (e.g., ChatGPT) trained on biomedical literature, capturing semantic and contextual information.

The multimodal fusion improves local surrogate accuracy near optimal regions, surpassing any unimodal representation. For modeling, a Bayesian neural network (BNN) accepts the concatenated, fused embedding and is tasked with approximating $f(g)$ , typically the phenotypic change induced by gene perturbation.

3. Enrichment Analysis and Biological Priors

BioBO exploits statistical gene set enrichment analysis (EA) to construct priors over the gene space. The EA procedure assesses whether top-performing perturbations identified by the surrogate model are over-represented in curated biological pathways from databases such as Gene Ontology or Hallmark.

For each pathway $P_i$ , a combined score $c(P_i)$ is calculated:

$c(P_i) = -o(P_i) \log(p(P_i))$

where $o(P_i)$ is the odds ratio of enrichment, and $p(P_i)$ is the adjusted $p$ -value from multiple hypothesis testing. Genes appearing in statistically significant pathways are then assigned higher prior probability in subsequent rounds, yielding a biological prior $\pi_n(x)$ for BO.

4. Augmented Acquisition Functions and Search Bias

The standard BO acquisition function $\alpha(x)$ is modified via the $\pi$ BO framework:

$\pi \alpha(x) = \alpha(x) \cdot [\pi_n(x)]^{\beta / L_n}$

where

$\beta$ is a hyperparameter governing the influence of the biological prior,
$L_n$ is the number of labeled experiments,
$\pi_n(x)$ is the pathway-based prior probability for gene $x$ .

As more labeled data is collected, BO transitions from prior-driven exploration to data-driven exploitation, ensuring early experimental rounds are effectively informed by pathway knowledge, but ultimately relying on empirical measurements.

5. Mechanistic Interpretability and Pathway-Level Explanations

A significant strength of BioBO is its capacity for mechanistic interpretability. Integration of EA provides pathway-level explanations for selected genes, revealing that discovered perturbations are statistically enriched in regulatory circuits such as MYC_TARGETS, E2F_TARGETS, and G2M_CHECKPOINT. This mechanistic attribution is supported by enrichment evidence (overlap counts, adjusted $p$ -values, odds ratios), enabling researchers to link experimental outcomes directly to established biological processes.

6. Performance Evaluation and Experimental Results

Empirical validation on public CRISPR screening datasets (e.g., IFN- $\gamma$ and IL-2 assays, GeneDisco suite) demonstrates that BioBO achieves:

Labeling efficiency improvement: 25–40% reduction in experimental budget required to identify top-performing perturbations compared to conventional BO approaches.
Increased cumulative top-k recall: Higher proportion of true high-impact perturbations are recovered early in the search, for both BioUCB and BioEI variants (acquisition functions augmented with biological priors).
Fused embeddings outperform unimodal representations: Multi-modal surrogate models correlate more tightly with observed assay performance, especially near optimal solutions.

7. Mathematical Foundation and Algorithmic Workflow

The algorithmic structure of BioBO can be summarized as follows:

Surrogate model construction: Train a BNN on fused gene embeddings to approximate $f(x)$ given labeled ( $L_n$ ) outcomes.
Enrichment analysis: After each round, perform EA on newly labeled data to estimate biological priors $\pi_n(x)$ across pathways.
Acquisition modification: Update the acquisition function via $\pi \alpha(x)$ using the computed biological prior.
Candidate selection and evaluation: Query the next batch of genes maximizing the augmented acquisition function.
Pathway-level interpretation: Annotate optimized gene sets with enriched pathway signatures for mechanistic explanation.

8. Impact and Implications

BioBO establishes a robust framework for experimental optimization in genomics, merging data-driven surrogate modeling, principled Bayesian search, and biologically grounded priors/interpretable outputs. Applications include drug target prioritization, functional genomics, synthetic biology, and gene therapy. The methodological advancements—multimodal embedding, enrichment-augmented acquisition, and mechanistic interpretability—address longstanding challenges in experimental budget constraints and hypothesis generation, evidenced by superior discovery rates and pathway-level mechanistic insight in contemporary CRISPR screens (Li et al., 24 Sep 2025).

A plausible implication is that further integration of complex biological priors (e.g., cell-type specific regulatory networks or spatial transcriptomics data) into BioBO pipelines could deepen mechanistic interpretability and accelerate hypothesis-driven experimentation in diverse biomedical contexts.

PDF Markdown Chat (Pro)

References (1)

BioBO: Biology-informed Bayesian Optimization for Perturbation Design (2025)

Follow Topic

Get notified by email when new papers are published related to Biology-Informed Bayesian Optimization (BioBO).