Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs

Published 13 Nov 2025 in cs.LG | (2511.10590v2)

Abstract: Batched synthesis and testing of molecular designs is the key bottleneck of drug development. There has been great interest in leveraging biomolecular foundation models as surrogates to accelerate this process. In this work, we show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization (Batch BO). This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them. Through the framework of Epistemic Neural Networks (ENNs), we obtain scalable joint predictive distributions of binding affinity on top of representations taken from large structure-informed models. Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance in Batch BO. Their utility is demonstrated by rediscovering known potent EGFR inhibitors on a semi-synthetic benchmark in up to 5x fewer iterations, as well as potent inhibitors from a real-world small-molecule library in up to 10x fewer iterations, offering a promising solution for large-scale drug discovery applications.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a scalable surrogate modeling approach using pretrained prior functions to achieve accurate joint predictions essential for batch selection.
It integrates chemistry-specific foundation models with Epistemic Neural Networks to capture binding affinity and quantify uncertainty in candidate batches.
The method demonstrates significant efficiency improvements, outperforming traditional optimization and random sampling in drug discovery benchmarks.

Scalable Joint Predictions for Batch Bayesian Optimization in Molecular Design

Introduction

This paper addresses a primary bottleneck in small-molecule drug discovery: the efficient selection of compound batches for experimental synthesis and testing in Design-Make-Test-Analyze (DMTA) cycles. Bayesian Optimization (BO), particularly its batch implementation, has been a canonical framework for navigating vast chemical spaces under uncertainty. The paper focuses on surrogate modeling for binding affinity given high-dimensional, diverse datasets, examining the need for scalable joint predictive distributions to enable parallel acquisition functions that account for correlations within a candidate batch. The work advances Epistemic Neural Networks (ENNs) by pretraining their additive prior networks on synthetic processes, integrating representations from chemistry-specific foundation models, and benchmarking performance on realistic drug discovery tasks.

Methodology

Batch Bayesian Optimization and Surrogate Requirements

Batch BO requires both:

Parallel acquisition functions that balance between exploitation and hedged exploration among selected candidate molecules;
Joint predictive densities $p_e(y_{1:N}|x_{1:N})$ capturing dependencies between candidates to reason about batch-wise uncertainty and facilitate acquisition such as qPO and EMAX.

Traditional surrogates like Gaussian Processes (GPs) excel at modeling joint predictions but scale cubically in data size, intractable for large chemical libraries. Deep ensembles and Bayesian Neural Networks (BNNs) ease scaling via parameter amortization but struggle to produce joint sample paths necessary for parallel acquisition—MC errors decrease slowly and are not sample-efficient for batch sizes typical in industrial workflows.

Epistemic Neural Networks and Pretrained Prior Functions

ENNs propose marginalizing output predictions over a latent "epistemic index" $z \sim p_z(z)$ , rather than network weights, yielding parameter-efficient surrogates and facilitating joint predictive sampling. Architecturally, ENNs combine:

A point-estimate function,
A trainable correction conditioned on $z$ ,
An additive prior network, $f_0(x, z)$ , frozen during learning.

Instead of hand-designing the functional form or initializing the prior randomly, the paper proposes pretraining $f_0$ on synthetic datasets sampled from a reference stochastic process (such as GP sample paths). This function-space regularization directly shapes the surrogate's inductive bias to align with domain-specific expectations (e.g., binding affinity distribution shapes) and improves calibration of joint uncertainty.

Algorithmic pretraining proceeds by minimizing squared loss between the prior network output and the reference paths, optionally warped to reflect realistic assay measurement distributions.

Representational Integration

Latent representations from large foundation models (e.g., COATI, Pairformer, ESM) serve as the ENN's input features, allowing the surrogate to exploit rich chemical structure and property relationships learned across billions of molecules and protein targets.

Experimental Results

Synthetic Benchmarks

Variants of the Epinet (ENN architecture) with linear, Random Fourier Features (RFF), and Pretrained prior functions are evaluated on joint negative log-loss (NLL) after training with synthetic warped GP data. The Pretrained variant consistently yields the best joint NLL across input dimensions, indicating more accurate modeling of batch-wise correlation and uncertainty. Notably, all variants perform similarly on marginal NLL, but only well-specified functional priors deliver strong joint predictions—directly impacting acquisition efficacy in real BO tasks.

EGFR Inhibitor Optimization

A semi-synthetic benchmark is constructed by augmenting 13,201 public EGFR inhibitors with decoys generated via COATI-LDM. Surrogate models using COATI embeddings as inputs are trained to maximize binding affinity (pIC50) over 46,000 compounds.

Batch BO is performed with three Epinet surrogates (Linear, RFF, Pretrained prior) and three acquisition strategies (greedy MAP, EMAX, qPO). The Pretrained Epinet combined with parallel acquisition functions retrieves the top pIC50 inhibitors in 5x fewer iterations than greedy baselines and achieves a 9x improvement over random selection for mean IC50 in the top-10 compounds at the final iteration. Performance is distinguished early in optimization, when modeling prior uncertainty is critical due to scarce data. This highlights the sample-efficiency gains enabled by accurately specifying joint predictive distributions.

Real-World Screening with tArray

The platform screens 50,000 compounds against a pipeline target using experimental fold-over (FO) measurements. Ablating the number of sampled particles for joint predictions confirms that Batch BO efficiency (measured as area-under-curve for top FO candidates) degrades rapidly when particle count is reduced. Here, drawing 5,000 joint samples for 50,000 compounds completes in 13 seconds on an A100 GPU, confirming the ENN framework's practical scalability compared to ensemble-based surrogates.

Implications and Future Directions

The demonstrated improvements have immediate practical consequences for drug screening workflows and lead optimization campaigns, enabling laboratory resources to be allocated more effectively by reducing unnecessary synthesis/testing cycles. The framework is amenable to more complex molecular representations (e.g., protein-ligand co-fold information) and alternative property prediction tasks (e.g., ADME). Pretraining prior networks using rich simulation data or physics-based simulators further expands applicability, as functional regularization has been shown to generalize better than weight-space priors for complex chemical objectives.

The findings also imply that function-space specification via pretrained priors empowers BO practitioners to inject detailed domain knowledge without the ambiguity associated with architectural tuning or weight heuristics. As chemical and biological foundation models continue to advance, integrating their representations into the ENN surrogate framework may further improve generalization and accelerate discovery cycles for scenario-specific molecular engineering.

Conclusion

This work presents a scalable surrogate modeling solution for Batch Bayesian Optimization in molecular discovery, validating substantive gains in sample efficiency and optimization performance through the function-space pretraining of ENN prior networks. The incorporation of foundation model representations and rapid joint sampling underpin its practical utility in both simulated and real-world screening pipelines. The results substantiate the claim that careful prior function specification directly enhances batch-aware uncertainty modeling and acquisition, providing a mathematically and computationally sound approach to large-scale chemical optimization tasks.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about speeding up drug discovery. When scientists design new small molecules (potential medicines), they usually make and test several at once, over many rounds. The authors show a smarter, faster way to pick which molecules to test in each round by using a special kind of machine-learning model. Their approach helps find strong “binders” (molecules that stick well to a target protein) in fewer rounds, saving time and cost.

What questions were the researchers trying to answer?

The team focused on two main questions:

How can we choose a batch of molecules to test so that we both chase promising ideas and hedge our bets (don’t pick a bunch of near-duplicates)?
How can we build a model that makes quick, reliable “joint predictions” about many molecules at once, including how their results are likely to be related?

In simpler terms: If you’re picking a team, you don’t want all players who do exactly the same thing. You want strong players and a mix of styles. The model should help pick a batch like that, and do it fast.

How did they do it?

Think of drug discovery as a game played over rounds:

Each round, you choose a batch of molecules to make and test.
You learn from the results, then choose the next batch.
Repeat until you find top performers.

To make smart batch choices, you need a “surrogate model.” That’s like a video-game simulator of the real lab: it predicts how well each molecule might bind without physically testing it. The tricky part is predicting many molecules at once and understanding how similar molecules’ results move together.

Here’s their approach, explained with everyday ideas:

Batch Bayesian Optimization (BO): A strategy for picking several items at once, balancing what you already know (exploitation) with exploring new options (exploration).
Joint predictions: Instead of guessing each molecule’s score separately, the model predicts them together, so it can spot when choices are too similar and avoid “putting all eggs in one basket.”
Epistemic Neural Networks (ENNs): A lightweight model family that creates many different “what-if” futures (called particles) very quickly. Each particle is like a different guess about the world.
Prior network: This is like giving the model a good gut feeling before training. The authors “pretrain” this prior using fake-but-realistic practice problems (synthetic data) that mimic how binding scores behave. They use a reference process called a “warped Gaussian process” to generate practice curves with the right shape (for example, scores that are bounded and skewed like real measurements).
Foundation model features: Instead of feeding raw molecule structures, they use a big pre-trained chemistry model called COATI to turn molecules into helpful numeric features (“embeddings”). Think of these as rich fingerprints for each molecule that help the model understand chemical similarity.
Acquisition functions: Methods to choose a batch.
- EMAX: Picks a batch that is expected to include a very strong candidate.
- qPO: Picks a batch that is likely to contain the single best candidate overall.
- Both functions consider correlations, so they avoid choosing a batch of lookalike molecules that might all rise or fall together.

Technical terms in plain language:

Gaussian process (GP): A smooth-curve generator used for modeling unknown functions.
Particles: Many fast, slightly different versions of the model’s prediction, used to estimate uncertainty and make better batch decisions.
pIC50: A score of how strong a molecule binds; higher pIC50 means stronger binding. Small increases in pIC50 can mean big improvements in actual potency.

What did they find, and why does it matter?

Key results:

Pretraining the prior network (the model’s “gut feeling”) made its joint predictions better. In tests, this improved the model’s batch choices compared to simple, hand-crafted priors (like a plain linear layer).
On a semi-synthetic EGFR dataset (a cancer-related target), their method rediscovered known potent inhibitors in up to 5 times fewer testing rounds than a strong greedy baseline.
On a large, real-world small-molecule library, their approach found potent binders in up to 10 times fewer rounds.
Rapid sampling matters: being able to generate thousands of joint predictions quickly (for example, 5,000 particles over 50,000 molecules in about 13 seconds on a single GPU) gave more reliable batch selection. When they reduced the number of particles, performance dropped.

Why it matters:

Fewer rounds to reach strong molecules means less time and lower cost.
Better hedging in each batch avoids testing many near-duplicates, which wastes lab time.
The method scales to big libraries, which is essential in modern drug discovery.

What is the potential impact?

This work suggests a practical path to faster, large-scale drug discovery:

Smarter batch choices: ENNs with pretrained priors help labs pick diverse, high-potential molecules each round.
Scalable to big datasets: Quick joint predictions make it usable on tens of thousands of molecules, not just tiny sets.
Flexible: The idea can plug into richer, structure-aware features from advanced biology models and can be extended beyond binding to other drug properties, like how the body absorbs and processes a drug (ADME).
Real-world gains: In their tests, the approach repeatedly found potent inhibitors much faster than common baselines, which could shorten the time to discover new medicines.

In short, the paper shows that giving the model a strong, data-driven “gut feeling” and making fast, well-informed joint predictions lets scientists find better molecules in fewer rounds. That can make drug discovery more efficient and more likely to succeed.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise, actionable list of knowledge gaps, limitations, and open questions identified in the paper. These focus on what is missing, uncertain, or left unexplored, and are intended to guide future research.

Generality across targets and modalities: Results are shown for a single public target (EGFR) and one proprietary screen; it remains unknown how well pretrained ENN priors transfer across diverse targets, protein classes, and binding modes.
Ligand-only representations: All experiments use ligand-only COATI embeddings. The benefit of structure-aware or pairwise representations (e.g., co-fold, AF3/Pairformer features) for joint predictions in Batch BO is untested.
Prior mismatch sensitivity: The prior network is pretrained on warped GP sample paths with hand-chosen kernels/warps. There is no systematic study of sensitivity to mismatched reference processes, kernel hyperparameters, or warp choices when real assay functions deviate from GP-like assumptions.
Reference-process selection: No principled method is provided to choose or learn the synthetic “reference process” for prior pretraining. How to select or adapt this process per-target or per-assay remains open.
Alternative priors: The paper does not compare GP-based functional priors to physics-based or simulator-derived priors (e.g., docking, ML-based structure scoring, cheminformatics rules) and their impact on joint predictive quality.
Calibration on real data: Joint predictive calibration is only evaluated on synthetic data (joint NLL via augmented dyadic sampling). There is no calibration assessment on real assays (e.g., joint reliability diagrams, correlation calibration, predictive rank correlation between designs).
Noise modeling: Batch BO is run with noiseless evaluations; assay noise, replicate variability, and heteroscedasticity are not modeled. The impact of noise on joint acquisition (EMAX/qPO) and ENN training is unexplored.
Acquisition-function scope: Only EMAX and qPO are used. There is no comparison to strong batched alternatives (e.g., qEI/qNEI, knowledge gradient, LITE estimators), nor hybrid/portfolio strategies that may be more sample-efficient.
Optimization of acquisitions: EMAX is optimized by simple random swaps without convergence guarantees. More powerful combinatorial optimizers or differentiable relaxations, and their compute/quality trade-offs, are not evaluated.
Particle budget guidelines: While empirical guidance suggests K ~ O(B²⁾ for EMAX, there are no principled error bounds, adaptive particle-allocation schemes, or diagnostics to determine K online for given pool/batch sizes.
Scaling to very large pools: qPO scales linearly with pool size and both EMAX/qPO require many particles. Strategies for pools in the millions (subsampling, two-stage screening, candidate pruning) and their impact on optimality are not addressed.
Computational cost accounting: The wall-clock, GPU memory, and energy costs of pretraining the prior, retraining from scratch each iteration, and evaluating large particle sets are not quantified or compared to ensemble or sparse GP baselines.
Retraining from scratch: Models are retrained each iteration; the benefits/risks vs. warm-starting or online/continual updates (catastrophic forgetting, bias-variance trade-offs, faster convergence) are not studied.
Baseline breadth: Comparisons exclude strong scalable probabilistic baselines (e.g., sparse GPs/deep kernel learning, stochastic variational GPs, deep ensembles with function-space regularization). Head-to-head performance and efficiency remain unknown.
Representation ablation: The effect of different molecular embeddings (e.g., ESM, graph transformers, learned task-specific adapters) on joint correlation structure and BO outcomes is not explored.
Mean-function role: The Pretrained Epinet drops the base mean function to induce non-Gaussian marginals. The trade-off between richer marginals and potential bias/underfitting, especially as data accumulates, is not analyzed.
Epistemic-index design: The dimensionality, distribution, and sampling of the epistemic index z (Sobol + Gaussian icdf) are fixed without ablation. How z-design affects joint correlations and acquisition performance is unclear.
Architectural specification: Details of the prior/learnable network architectures (depth/width/activations) and their effect on joint prediction fidelity and compute are limited; systematic architecture search is missing.
Mis-specification robustness: The robustness of Batch BO to severely mis-specified priors (wrong kernel class, multi-modality, discontinuities) and to adversarial prior choices is not quantified.
Multi-objective and constrained BO: Extensions to multiple properties (potency, ADME, selectivity) and constraints (toxicity, synthesizability, novelty) are not implemented or evaluated.
De novo design coupling: The method operates over fixed libraries. Its integration with generative design loops (propose-and-rank, on-policy generation guided by joint acquisitions) is untested.
Realistic campaign considerations: Synthesis batching constraints, per-iteration turnaround time, cost-aware acquisitions, and diminishing returns across DMTA cycles are not modeled.
Data realism in EGFR benchmark: Semi-synthetic labels for decoys are generated via a GP conditioned on known inhibitors, which may favor GP-like priors. Sensitivity to different decoy-labeling schemes (e.g., docking, QSAR, random negatives) is not assessed.
External validity of tArray results: The proprietary dataset and metric (fold-over fluorescence) limit reproducibility. It is unclear how well outcomes transfer to standard potency endpoints or other experimental platforms.
Metric diversity: Optimization is measured by Top-1/Top-10 normalized potency and AUC. Classic BO metrics (simple/cumulative regret, hit rate at potency thresholds, enrichment factors) and their alignment with discovery goals are not reported.
Out-of-distribution behavior: The method’s behavior far from the training distribution (e.g., novel chemotypes, scaffold hops) and the reliability of joint uncertainty under distribution shift are not characterized.
Pool-selection bias: Warm-start exclusions based on nearest neighbors may shape early data distribution; the impact of different warm-start strategies on BO dynamics is not explored.
Theoretical properties: There is no theory on when pretrained ENN priors guarantee improved joint log-loss or acquisition efficiency, nor analysis of the induced function class and correlation structure relative to the reference process.
Adaptive prior refinement: There is no mechanism to adapt or recalibrate the prior during a campaign (meta-learning, hyperparameter posterior updates, Bayesian model averaging over reference processes).
Practical guidance: Clear recipes are missing for choosing reference-process hyperparameters, particle counts, batch sizes, acquisition optimizers, and representation choices as a function of data scale, target type, and compute budget.
Robustness to assay artifacts: Handling outliers, batch effects, and plate/array confounders in high-throughput assays and their impact on joint predictions/acquisitions are not addressed.
Uncertainty decomposition: The method does not disentangle aleatoric vs. epistemic uncertainty in joint predictions, which could improve acquisition design and noise-aware training.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following items summarize concrete, deployable use cases that can be implemented now, drawing on the paper’s demonstrated methods and results.

Healthcare/Pharma — Batch selection in HTS and DMTA cycles
- Use case: Replace greedy batch selection with ENN-based Batch BO (EMAX or qPO) to prioritize compound batches for synthesis/assay while hedging correlations among candidates.
- Sector: Healthcare/Pharmaceutical R&D; CROs.
- Tools/products/workflows:
- “Batch Picker” microservice that exposes EMAX and qPO over a joint predictive distribution built with a Pretrained Epinet using COATI embeddings.
- Integration with ELNs and lab automation to retrain the surrogate after each round (train from scratch per iteration as in the paper), sample 5,000 particles, and select batches of 25–50 compounds.
- Metrics dashboard for Top-1 and Top-10 potency (e.g., pIC50) and area-under-curve tracking across iterations.
- Evidence from paper: 5× fewer iterations to match greedy Top-1 pIC50; 7–9× improvement in final Top-10 pIC50/IC50-fold vs baselines; 10× fewer iterations in microarray library to surpass greedy Top-1 FO.
- Assumptions/dependencies:
- Access to high-quality latent embeddings (e.g., COATI) and GPU resources; paper reports 13 seconds to sample 5,000 particles for 50,000 compounds on an Nvidia A100.
- Warm-start data (e.g., 100–5,000 initial compounds) to anchor training; batch sizes tuned to lab throughput.
- The surrogate retrains each round; data are noiseless or calibrated (the paper uses precomputed labels), with warping consistent with assay distributions.
Healthcare/Pharma — Microarray-based screening prioritization (tArray-style)
- Use case: Prioritize 50-compound batches across ~50,000 candidate molecules to maximize fold-over (FO) per iteration, using rapid ENN joint sampling to evaluate acquisition functions.
- Sector: Healthcare/Pharmaceutical HTS platforms.
- Tools/products/workflows:
- “Microarray Acquisition Optimizer” that runs EMAX/qPO powered by ENN joint samples, orchestrated with microarray scheduling.
- Real-time batch re-ranking using quasi-random Sobol epistemic indices and efficient GPU inference.
- Evidence from paper: EMAX/qPO outperform greedy in the first iteration; AUC degrades predictably with fewer particles, emphasizing the need for rapid joint sampling.
- Assumptions/dependencies:
- Reliable FO measurements and consistent negative controls.
- Sufficient compute and an optimized inference stack for batched forward passes.
Healthcare/Pharma — Lead optimization with hedged exploration
- Use case: In hit-to-lead and lead optimization, use ENN joint predictions to balance exploitation of high predicted potency with hedging against correlated failures inside a batch.
- Sector: Medicinal chemistry programs.
- Tools/products/workflows:
- “Hedged Lead Selection” workflow combining Pretrained Epinet surrogates and EMAX inner-loop stochastic local search (single random swaps).
- Periodic re-warping/calibration of labels to align surrogate outputs with bounded, skewed assay distributions.
- Assumptions/dependencies:
- Warping functions (e.g., sigmoid-power) approximate assay marginals; sufficient representation coverage (ligand-only works now; structure-aware preferred when available).
Academia — Reproducible active learning experiments for correlated candidates
- Use case: Adopt ENN-based joint predictive distributions to study batch selection under candidate correlations; benchmark with joint NLL and augmented dyadic sampling.
- Sector: Academic ML for scientific discovery; computational chemistry.
- Tools/products/workflows:
- Use the provided “terramax” code to replicate EGFR and microarray results; run ablations on prior network designs (linear vs RFF vs pretrained).
- Teaching labs on function-space priors; synthetic GP-warped datasets to evaluate joint vs marginal NLL.
- Assumptions/dependencies:
- Student-accessible GPU resources; availability of public datasets (e.g., BindingDB EGFR).
Software — Deployable Batch BO service for molecular design
- Use case: Offer a cloud/API service that ingests embeddings and assay data, builds ENN surrogates with pretrained priors, and returns hedged batches via EMAX/qPO.
- Sector: Software, AI platforms for drug discovery.
- Tools/products/workflows:
- Components: Synthetic-Prior Trainer (matches GP sample paths), Epinet Sampler (fast joint sampling), Batch Optimizer (EMAX/qPO), Calibration & Monitoring.
- Assumptions/dependencies:
- Foundation-model embeddings (e.g., COATI) under license; data governance and privacy compliance.
Robotics/Lab Automation — Closed-loop batch selection
- Use case: Combine ENN-based Batch BO with automated synthesis and screening robots to iterate DMTA cycles with hedged batch choices.
- Sector: Robotics, automated experimentation.
- Tools/products/workflows:
- Sequence: Acquire batch → synthesize → assay → retrain surrogate → repeat; integrate with scheduling and inventory systems.
- Assumptions/dependencies:
- Reliable data ingestion latency; standardized retraining triggers; robust versioning of priors and epistemic index buffers.

Long-Term Applications

These items are feasible but require additional research, scaling, or development beyond the paper’s current scope.

Healthcare/Pharma — Multi-objective Batch BO (potency + ADME/Tox + developability)
- Use case: Simultaneous optimization of potency and liabilities (e.g., hERG, solubility, permeability) using multi-task ENN surrogates and joint acquisition functions that hedge correlations across objectives.
- Sector: Healthcare/Pharma; biotech.
- Tools/products/workflows:
- “Multi-Property Batch Optimizer” with Pareto-aware or scalarized EMAX variants under joint predictive distributions.
- Dependencies/assumptions:
- High-quality multi-property datasets; robust multi-task priors; careful calibration across heterogeneous assays and noise levels.
Healthcare/Pharma — Structure-informed batch optimization
- Use case: Integrate pair representations from Pairformer or AlphaFold 3 to capture target–ligand structural interactions in joint predictions for batch BO.
- Sector: Structural biology, computational chemistry.
- Tools/products/workflows:
- “Structure-Informed Epinet” adapters that take fixed structure-aware latents; pretraining priors on physics-based simulators or co-folding-derived function ensembles.
- Dependencies/assumptions:
- Access to reliable structures/co-folds; computational costs and licensing; validation against prospective assays.
Healthcare/Pharma — Closed-loop generative design with hedged batch acquisition
- Use case: Couple generative models (e.g., COATI-LDM) for proposal with ENN Batch BO for selection to form a self-driving pipeline: propose → prioritize (hedged) → make/test → update.
- Sector: Drug design platforms.
- Tools/products/workflows:
- “Design-Make-Test-Optimize” orchestrator combining particle guidance diversity with joint acquisition hedging; chemistry-aware constraints.
- Dependencies/assumptions:
- On-demand or rapid synthesis capabilities; generative model safety filters; robust priors that generalize out-of-distribution.
Cross-domain science — Materials, catalysts, polymers discovery
- Use case: Apply ENN joint predictions and Batch BO to optimize material properties (e.g., catalytic activity, conductivity) with simulation-driven functional priors.
- Sector: Materials science; energy.
- Tools/products/workflows:
- “Simulation-Prior Epinet” trained against ensembles of physics simulators; batch selection for experimental campaigns.
- Dependencies/assumptions:
- Simulator fidelity and diversity; mapping between simulation and assay domains; domain shift handling.
Personalized medicine — Variant-specific ligand optimization
- Use case: Optimize ligand batches against different patient-specific variants (e.g., EGFR mutations), hedging within variant-defined pools.
- Sector: Precision therapeutics.
- Tools/products/workflows:
- Variant-aware embeddings and priors; per-variant joint acquisition functions and batch planning.
- Dependencies/assumptions:
- Access to patient variant data; ethical/regulatory frameworks; validation of surrogate transferability per variant.
Policy/Regulation — Standards for AI surrogates in drug discovery
- Use case: Establish guidance on documenting functional priors, joint prediction calibration, acquisition function choices, and reproducibility for AI-driven batch selection.
- Sector: Policy, regulatory science.
- Tools/products/workflows:
- “AI Surrogate Audit Pack”: reports on prior specification, particle counts, joint NLL, calibration curves, data lineage.
- Dependencies/assumptions:
- Cross-industry consensus; transparent model reporting; prospective, peer-reviewed validations.
Education/Training — Curriculum and benchmarks for joint predictive modeling
- Use case: Create teaching modules and standardized benchmarks focusing on joint vs marginal predictive performance, function-space priors, and batch acquisition efficiency.
- Sector: Academia; professional training.
- Tools/products/workflows:
- “Joint NLL Challenge” datasets with warped GP paths and real assay tasks; public leaderboards.
- Dependencies/assumptions:
- Community adoption; accessible tooling; sustained dataset curation.
Software/Platforms — Enterprise-scale Batch BO as a managed service
- Use case: Offer managed ENN-based Batch BO that scales particle sampling, maintains prior libraries (synthetic and simulator-based), and supports multi-target campaigns.
- Sector: Software, cloud AI services for discovery.
- Tools/products/workflows:
- GPU autoscaling, prior registry, per-campaign calibration services, MLOps for retraining each iteration, and ELN/LIMS connectors.
- Dependencies/assumptions:
- Cost-effective GPU availability; data governance; IP/licensing for foundation models and structure predictors.

Cross-cutting assumptions and dependencies

Quality of latent representations: Ligand-only COATI embeddings work now; structure-aware latents are preferred for broader generalization.
Prior network specification: Pretraining on synthetic reference processes (e.g., warped GPs) improves joint predictions; mis-specified priors can hurt hedging and batch efficiency.
Particle counts and compute: Convergent estimates typically require thousands of particles (∝ batch size squared); fast joint sampling (GPU) is crucial.
Data calibration and warping: Assay marginals are often bounded/skewed; label warping improves surrogate fit and calibration.
Distribution shift: Generalization across targets, libraries, and assays requires careful validation and potentially simulator-informed priors.
Workflow integration: Success depends on seamless retrain-select-test loops, versioned priors, and robust data pipelines.

View Paper Prompt View All Prompts

Glossary

ADME: Acronym for absorption, distribution, metabolism, and excretion; key pharmacokinetic properties relevant to drug discovery. "extending it to other properties such as ADME and beyond."
additive prior functions: Frozen neural network components added to ENNs that encode prior beliefs and are conditioned on the same latent variable. "a key design compo- nent of ENNs is the use of additive prior functions, which are frozen networks fø (x, z) conditioned on the same latent variable z (Dwaracherla et al., 2022)."
augmented dyadic sampling: A technique for evaluating joint predictive performance (e.g., joint NLL) of models via paired sampling. "joint negative log-loss evaluated using aug- mented dyadic sampling (Osband et al., 2021)."
Batch Bayesian Optimization (Batch BO): Bayesian optimization where multiple candidates are selected and evaluated in parallel per iteration. "This process is naturally framed as Batch Bayesian Optimization (Batch BO) (Garnett, 2023)."
Bayesian Neural Networks (BNNs): Neural networks with probabilistic weight priors enabling Bayesian inference over parameters. "ensemble- based Bayesian Neural Networks (BNNs) with weight-space priors are particularly sensitive to hyperparameters (Cinquin et al., 2025; Arbel et al., 2023), and there is ambiguity as to what functions they induce."
COATI: A chemistry foundation model providing ligand representations used as inputs to surrogates. "In our experiments, we use COATI, a ligand-only representation (Kaufman et al., 2024b), to address a single target."
deep ensembles: Collections of independently trained neural networks whose predictive distribution approximates Bayesian uncertainty. "A simple method to construct probabilistic surrogates for large datasets is via deep ensembles (Lakshminarayanan et al., 2017; Wild et al., 2023),"
Design-Make-Test-Analyze (DMTA) cycles: Iterative process in drug discovery involving design, synthesis, testing, and analysis of compounds. "repeated rounds of Design-Make-Test-Analyze (DMTA) cycles,"
Dirac delta: A distribution concentrated at a single point, used to express discrete mass in probabilistic formulations. "Where d (x = . ) is the Dirac delta centered at x."
EGFR inhibitors: Compounds that inhibit the Epidermal Growth Factor Receptor, a common drug target. "We use 13,201 unique publicly reported EGFR inhibitors extracted from BindingDB (Liu et al., 2025)."
EMAX: A parallel acquisition function equal to the expected maximum value over a batch. "EMAX: Ep(y1;B) [max(§1:B)]. Expected maximum value in batch (Azimi et al., 2010), sometimes called parallel simple-regret (qSR) (Wilson et al., 2017)."
Epistemic entropy: Uncertainty measure reflecting lack of knowledge about model predictions. "This can then be converted to a posterior predictive that can also be used to approximate epistemic entropy:"
Epistemic Neural Networks (ENNs): Ensemble-like neural networks that marginalize over a latent epistemic index to produce joint predictive distributions. "Epistemic Neural Networks (ENNs) (Osband et al., 2021) are a similar ensemble method that can be used to obtain joint predictive distributions"
epistemic index: Latent variable over which ENNs marginalize to generate different predictive functions. "with epistemic index z ~ pz."
fold-over (FO): A fluorescence-based measurement quantifying target–ligand interaction relative to a control. "Target-ligand in- teraction is quantified via fluorescence intensity for a given molecule relative to the negative control (fold-over, FO)."
Gaussian processes (GPs): Nonparametric probabilistic models defining distributions over functions, enabling exact joint posterior inference. "Gaussian processes (GPs) are the canonical probabilistic surrogate used for Batch BO for this reason, as they can provide exact joint posterior inference, straightforwardly accessible via sam- ple paths."
hedging: Selecting diversified candidates within a batch to mitigate correlated risks and improve exploration–exploitation balance. "parallel acquisition functions that hedge be- tween selected designs,"
inverse CDF (icdf): Transformation mapping uniform samples to a target distribution, used to generate quasi-random Gaussian samples. "apply a unit Gaussian icdf to get quasi- random Gaussian samples."
joint negative log-loss (NLL): Loss measuring the quality of joint predictive distributions across multiple points. "pretraining the prior network using a ref- erence process yields improved joint log-loss which translates to better Batch BO performance."
joint predictive distribution: A multivariate predictive distribution over multiple inputs capturing correlations needed for parallel acquisition. "joint predictive distributions that capture the correlations necessary for such acquisition functions (Wen et al., 2021)."
latent diffusion model (LDM): Generative model operating in latent space to sample molecules or structures. "a latent diffusion model of small molecules, COATI-LDM (Kaufman et al., 2024a),"
latent representations: Feature vectors produced by foundation models capturing structural or chemical information. "use latent-representations from a large founda- tion model of chemistry (Kaufman et al., 2024b)."
lengthscale: Kernel hyperparameter controlling function smoothness and correlation decay in GPs or RFFs. "using a length-scale of l = 0.5768."
marginal negative log-loss (NLL): Loss measuring predictive quality at individual points, ignoring joint correlations. "Epinet variants are not strongly distinguished on marginal negative log-loss (NLL)."
Matern32 kernel: A GP covariance function with specific smoothness controlling correlations between inputs. "Unseen test labels are obtained by sampling single warped GP paths using a Matern32 kernel with lengthscale l x Vd"
Monte Carlo sampling: Randomized sampling method to approximate expectations over complex distributions. "be used to perform inference via Monte Carlo sampling."
particle guidance: Sampling technique for diffusion models that promotes diverse, non-i.i.d. samples. "We sample these "decoy" compounds using particle guidance (Corso et al., 2023)"
pIC50: Negative logarithm of IC50 concentration; a binding affinity measure where higher values imply stronger inhibition. "We aim to maximize experimen- tal binding affinity measured in pIC50 units."
prior network: The fixed component of an ENN that encodes a functional prior and shapes joint predictions. "the prior network remains fixed and divergence from it is tuned by regularizing the weights of the learnable network."
probabilistic surrogate: Predictive model providing uncertainty-aware estimates to guide optimization. "prob- abilistic surrogate pe(y1:N|x1:N) in order to capture the correlations between candidate designs."
qPO: A parallel acquisition function estimating the probability that a batch contains the global maximum. "qPO: Ep(11:N) [max(y1:N) ≤ max(§1:B)]. The probabil- ity of a batch containing the global maximum"
quasi-random Gaussian samples: Low-discrepancy samples transformed to Gaussian space to improve coverage. "and apply a unit Gaussian icdf to get quasi- random Gaussian samples."
Random Fourier Features (RFF): Technique to approximate kernels by mapping inputs into randomized trigonometric features. "Random Fourier Features: RFF(x) = 2/d cos (W x + b) where W ~N(0,2-2I), b ~ Unif(0,27), and l is a lengthscale hyperparameter (Rahimi & Recht, 2007)."
sample path: A single function realization drawn from a stochastic process like a GP, used for joint sampling. "sample paths from a Gaussian Process (GP)"
sigmoid-power warp: Nonlinear transformation applied to GP outputs to induce bounded, skewed marginals. "We use a simple sigmoid-power warp(h; a, b, c) = S((h-a)/b)" where S(.) is the sigmoid function"
Sobol sequence: Low-discrepancy sequence used to generate well-spread samples in high-dimensional spaces. "using a low-discrepancy Sobol sequence with a burn-in of 100,"
stochastic local-search: Randomized optimization routine using small perturbations (e.g., swaps) to improve a batch. "We use a simple stochastic local-search procedure that uses single random swaps of the batch"
stop-grad: Operation that detaches activations from the computational graph to prevent gradient updates. "use detached (stop-grad) sg[.] hidden activations ã from a pretrained base network"
submodular: Property of set functions enabling greedy optimization with guarantees; EMAX is noted not to have it. "EMAX is not submodular (Azimi et al., 2010),"
ultradense microarray: High-density experimental platform enabling rapid measurement of many interactions. "a proprietary ultradense microarray (tArray)"
warped GP: A Gaussian process whose outputs are transformed (warped) to yield non-Gaussian marginals. "we use a warped GP as a reference process to generate synthetic datasets of samples with non-Gaussian marginals,"

Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs

Summary

Scalable Joint Predictions for Batch Bayesian Optimization in Molecular Design

Introduction

Methodology

Batch Bayesian Optimization and Surrogate Requirements

Epistemic Neural Networks and Pretrained Prior Functions

Representational Integration

Experimental Results

Synthetic Benchmarks

EGFR Inhibitor Optimization

Real-World Screening with tArray

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were the researchers trying to answer?

How did they do it?

What did they find, and why does it matter?

What is the potential impact?

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Authors (10)

Collections

Tweets

YouTube

Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs

Summary

Scalable Joint Predictions for Batch Bayesian Optimization in Molecular Design

Introduction

Methodology

Batch Bayesian Optimization and Surrogate Requirements

Epistemic Neural Networks and Pretrained Prior Functions

Representational Integration

Experimental Results

Synthetic Benchmarks

EGFR Inhibitor Optimization

Real-World Screening with tArray

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were the researchers trying to answer?

How did they do it?

What did they find, and why does it matter?

What is the potential impact?

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (10)

Collections

Tweets

YouTube