Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions (2307.15073v1)

Published 14 Jul 2023 in q-bio.BM, cs.LG, and stat.ML

Abstract: Accelerating the discovery of novel and more effective therapeutics is an important pharmaceutical problem in which deep learning is playing an increasingly significant role. However, real-world drug discovery tasks are often characterized by a scarcity of labeled data and significant covariate shift$\unicode{x2013}\unicode{x2013}$a setting that poses a challenge to standard deep learning methods. In this paper, we present Q-SAVI, a probabilistic model able to address these challenges by encoding explicit prior knowledge of the data-generating process into a prior distribution over functions, presenting researchers with a transparent and probabilistically principled way to encode data-driven modeling preferences. Building on a novel, gold-standard bioactivity dataset that facilitates a meaningful comparison of models in an extrapolative regime, we explore different approaches to induce data shift and construct a challenging evaluation setup. We then demonstrate that using Q-SAVI to integrate contextualized prior knowledge of drug-like chemical space into the modeling process affords substantial gains in predictive accuracy and calibration, outperforming a broad range of state-of-the-art self-supervised pre-training and domain adaptation techniques.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces Q-SAVI, a probabilistic model that embeds domain-informed priors to overcome covariate shifts in drug discovery data.
It employs function-space variational inference to explicitly integrate prior knowledge, boosting predictive accuracy and uncertainty calibration.
Experimental results demonstrate Q-SAVI’s superior performance over traditional methods across multiple dataset splits reflecting varying covariate and label shifts.

Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions

This paper addresses the application of deep learning models in early-stage drug discovery, focusing on scenarios where data is scarce and influenced by covariate shifts. Traditional deep learning approaches often fall short in such conditions, failing to generalize effectively and providing unreliable predictions for out-of-distribution data points. The authors present Q-SAVI, a probabilistic model designed to incorporate domain-informed prior knowledge into the training process, thus enhancing predictive accuracy and calibration under covariate shift.

Introduction

The paper begins by outlining the significant role that deep learning can play in accelerating drug discovery, particularly in predicting molecular properties that are clinically relevant. However, it also points out the limitations faced due to the scarcity of labeled data and significant covariate shift, which are common in real-world drug discovery tasks. The authors introduce Q-SAVI, which encodes explicit prior knowledge of the data-generating process into a prior distribution over functions, aiming to improve model performance in these challenging settings.

Key Contributions

Q-SAVI Model:
- The proposed model leverages a prior distribution over functions. This approach deviates from the common practice of indirectly encoding inductive biases through pre-training, data augmentation, or architectural modifications.
- Q-SAVI performs variational inference over function evaluations rather than parameters, allowing for explicit incorporation of domain knowledge about the data-generating process.
Evaluation Setup:
- A robust and comprehensive evaluation setup is established using a carefully curated bioactivity dataset for training and testing the model. The authors emphasize that common datasets may not sufficiently test a model’s extrapolative capabilities due to inherent data biases and low-quality labels.
- The dataset was meticulously processed to exclude false positives and experimental artifacts, ensuring high-quality data for meaningful model comparisons.
Covariate and Label Shift Quantification:
- The authors introduce several splitting techniques, including random, scaffold, molecular weight, and spectral clustering-based splits to induce varying degrees of covariate and label shift.
- Detailed statistical metrics were used to quantify these shifts and validate the robustness of the evaluation setup.
Experimental Results:
- Extensive experiments were conducted comparing Q-SAVI with several baselines, including logistic regression, random forest, multi-layer perceptrons, deep ensembles, and state-of-the-art pre-trained graph neural networks.
- Q-SAVI achieved significant improvements in predictive accuracy and calibration on both highly shifted and less shifted datasets, showcasing its robustness and practical utility.

Methodology

Q-SAVI estimates a posterior distribution over the functions that define the model rather than just the parameters. By leveraging function-space variational inference, the model allows for the specification of priors that can encode a nuanced understanding of chemical space and the data generation process. Specifically, Q-SAVI facilitates the integration of prior knowledge by:

Defining a prior distribution over parametric function mappings.
Extending the probabilistic model to include a label-space prior over an extended set of context points.
Employing function-space variational inference to optimize the model’s posterior distribution over these function evaluations.

The approach was further validated through rigorous empirical evaluation, including the utilization of metrics like AUC-ROC, Brier Score, and statistics designed to measure covariate and label shifts.

Implications and Future Work

The findings suggest that explicitly encoding domain-informed prior knowledge into the training of deep learning models can substantially improve both predictive performance and uncertainty quantification in drug discovery applications. This has practical implications for tasks that require robust out-of-distribution generalization, such as identifying lead compounds in new chemical domains or optimizing pharmacokinetic properties.

The primary limitation noted is the increased computational cost associated with evaluating the function-space variational inference over a large set of context points. However, the authors argue that this cost is manageable and comparable to other sophisticated deep learning techniques.

Future work could explore the utility of Q-SAVI in active learning frameworks, potentially reducing experimental costs by directing efforts to the most informative experiments. Additionally, extending this approach to cover more complex molecular property prediction tasks and integrating it with generative models for molecule design presents promising avenues for further research.

Conclusion

This paper presents a significant step forward in addressing the challenges posed by data scarcity and covariate shift in early-stage drug discovery. By leveraging domain-informed prior distributions over functions via Q-SAVI, the authors demonstrate marked improvements in model performance, suggesting a viable path forward for utilizing deep learning in drug discovery. This work underscores the importance of tailoring machine learning approaches to the unique challenges of biomedical research and opens up new possibilities for advancing computational methodologies in this domain.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now