Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

QComp: A QSAR-Based Data Completion Framework for Drug Discovery (2405.11703v1)

Published 20 May 2024 in cs.LG

Abstract: In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Encyclopedia of computational chemistry. Wiley Online Library, 1998.
  2. Random forest: a classification and regression tool for compound classification and qsar modeling. Journal of Chemical Information and Computer Sciences, 43(6):1947–1958, 2003.
  3. William S Noble. What is a support vector machine? Nature Biotechnology, 24(12):1565–1567, 2006.
  4. Gaussian processes: a method for automatic qsar modeling of adme properties. Journal of Chemical Information and Modeling, 47(5):1847–1857, 2007.
  5. Deep neural nets as a method for quantitative structure–activity relationships. Journal of Chemical Information and Modeling, 55(2):263–274, 2015.
  6. Deep learning in drug discovery. Molecular Informatics, 35(1):3–14, 2016.
  7. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017.
  8. Analyzing Learned Molecular Representations for Property Prediction. Journal of Chemical Information and Modeling, 59(8):3370–3388, 2019.
  9. Deeppurpose: a deep learning library for drug–target interaction prediction. Bioinformatics, 36(22-23):5545–5547, 2020.
  10. Integrating qsar modelling and deep learning in drug discovery: the emergence of deep qsar. Nature Reviews Drug Discovery, pages 1–15, 2023.
  11. Multi-task neural networks for qsar predictions. arXiv preprint arXiv:1406.1231, 2014.
  12. Modeling industrial admet data with multitask networks. arXiv preprint arXiv:1606.08793, 2016.
  13. Predictive Multitask Deep Neural Network Models for ADME-Tox Properties: Learning from Large Data Sets. Journal of Chemical Information and Modeling, 59(3):1253–1268, 3 2019.
  14. Improvement in admet prediction with multitask deep featurization. Journal of Medicinal Chemistry, 63(16):8835–8848, 2020.
  15. Qsar without borders. Chemical Society Reviews, 49(11):3525–3564, 2020.
  16. Machine learning for in silico admet prediction. Artificial Intelligence in Drug Design, pages 447–460, 2022.
  17. Analysis of the benefits of imputation models over traditional qsar models for toxicity prediction. Journal of Cheminformatics, 14(1):1–27, 2022.
  18. Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525, 2001.
  19. A bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088–2096, 2003.
  20. Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5):498–513, 2011.
  21. Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics, 21(2):187–198, 2005.
  22. Flexible multivariate imputation by MICE. Leiden: TNO, 1999.
  23. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012.
  24. Macau: scalable bayesian multi-relational factorization with side information using mcmc. arXiv preprint arXiv:1509.04610, 2015.
  25. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  26. Imputation of assay bioactivity data using deep learning. Journal of Chemical Information and Modeling, 59(3):1197–1204, 2019.
  27. Profile-QSAR 2.0: Kinase Virtual Screening Accuracy Comparable to Four-Concentration IC50s for Realistically Novel Compounds. Journal of Chemical Information and Modeling, 57(8):2077–2088, 2017.
  28. Missing the point: Non-convergence in iterative imputation algorithms. In First Workshop on the Art of Learning with Missing Values (Artemiss) hosted by the 37 th International Conference on Machine Learning (ICML), 2020.
  29. Practical strategies for handling breakdown of multiple imputation procedures. Emerging Themes in Epidemiology, 18(1):5, 2021.
  30. Extension of pqsar: Ensemble model generated by random forest and partial least squares regressions. IEEE Access, 8:180087–180099, 2020.
  31. Predicting Total Drug Clearance and Volumes of Distribution Using the Machine Learning-Mediated Multimodal Method through the Imputation of Various Nonclinical Data. Journal of Chemical Information and Modeling, 62(17):4057–4065, 9 2022.
  32. PubChem 2023 update. Nucleic Acids Research, 51(D1):D1373–D1380, 1 2023.
  33. Predicting Fraction Unbound in Human Plasma from Chemical Structure: Improved Accuracy in the Low Value Ranges. Molecular Pharmaceutics, 15(11):5302–5311, 11 2018.
  34. Reliable Prediction of Caco-2 Permeability by Supervised Recursive Machine Learning Approaches. Pharmaceutics, 14(10), 10 2022.
  35. Combining machine learning and molecular dynamics to predict P-glycoprotein substrates. Journal of Chemical Information and Modeling, 60(10):4730–4749, 10 2020.
  36. Pred-hERG: A Novel web-Accessible Computational Tool for Predicting Cardiac Toxicity. Molecular Informatics, 34(10):698–701, 10 2015.
  37. Comparison of logP and logD correction models trained with public and proprietary data sets. Journal of Computer-Aided Molecular Design, 36(3):253–262, 3 2022.
  38. Pruned Machine Learning Models to Predict Aqueous Solubility. ACS Omega, 5(27):16562–16567, 7 2020.
  39. Boosting the predictive performance with aqueous solubility dataset curation. Scientific Data, 9(1), 12 2022.
  40. Predicting Solubility Limits of Organic Solutes for a Wide Range of Solvents and Temperatures. Journal of the American Chemical Society, 144(24):10785–10797, 6 2022.
  41. Chemprop: A machine learning package for chemical property prediction. Journal of Chemical Information and Modeling, 64:9–17, 2024.
  42. Effect of missing data on multitask prediction methods. Journal of Cheminformatics, 10(1):1–12, 2018.
  43. Predicting Critical Properties and Acentric Factors of Fluids Using Multitask Machine Learning. Journal of Chemical Information and Modeling, 63(15):4574–4588, 8 2023.
  44. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  45. fancyimpute: An imputation library for python. URL https://github. com/iskandr/fancyimpute, 2016.
  46. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45:1–67, 2011.
  47. Greg Landrum. RDKit: Open-Source Cheminformatics, 2006. (accessed November 29, 2023).

Summary

  • The paper introduces QComp, a data completion framework that improves QSAR predictions by leveraging correlations in sparse experimental data.
  • It employs a probabilistic model with one-shot data completion to update biochemical activity predictions based on new data.
  • Experimental results demonstrate significant accuracy improvements, with r² scores rising from 0.487 to 0.620 in key ADMET datasets.

Streamlined Drug Discovery with QComp: Enhancing QSAR Models

Background on QSAR Models

In the world of drug discovery, predicting a molecule's biochemical activities, such as its efficacy and toxicity, is crucial. This activity falls under the umbrella of Quantitative Structure-Activity Relationship (QSAR) models. QSAR models correlate the chemical structure of compounds with their biological activities, making them essential for high-throughput screening in material and drug discovery.

However, these models face significant challenges. Experimental data keep accumulating and evolving, making historic data sets massive but often sparse. Simply retraining QSAR models with new data isn't always feasible or cost-effective, especially when new experimental data is relatively small compared to pre-existing data.

Introducing QSAR-Complete (QComp)

Enter QSAR-Complete, or QComp, a novel data completion framework designed to meet these challenges head-on. QComp enhances traditional QSAR models by leveraging correlations inherent in the available experimental data, improving prediction accuracy across various tasks. Its benefits extend to guiding the sequence of experiments in drug discovery, reducing uncertainties and supporting more rational decision-making.

Methodology

Probabilistic Framework: At the core of QComp is a probabilistic model that treats the biochemical activities of a molecule as a probability distribution influenced by the molecule's chemical structure. This model accounts for both the known and unknown biochemical activities, updating predictions based on new experimental data. The underlying assumption is that the deviations of these activities from QSAR model predictions follow a normal distribution.

Training and Data Completion: The QComp model is trained using a log-likelihood loss function. Once trained, it can perform one-shot data completion for missing biochemical activity data by estimating the most probable values based on observed data and pre-existing QSAR models.

Experimental Results

Data and Models: QComp was evaluated using several datasets, including three proprietary ADMET datasets and one public dataset. For instance, the ADMET-750k dataset comprises data from 32 assays related to small molecules, while the public dataset involves data from 25 assays for over 114,000 small molecules. QComp utilized multi-task Chemprop models and other baseline models for data completion.

Benchmark Performance:

  • Improvement Across the Board: In the ADMET-750k dataset, QComp systematically outperformed various baseline data completion methods like MICE, Missforest, and Macau. It improved the mean squared Pearson correlation coefficient (r2r^2) from 0.487 (base QSAR) to 0.620. Compared to other methods, QComp was the most robust, maintaining or improving accuracy for nearly all assays tested.
  • Human vs. Animal Data: In the fup dataset, which contains fraction unbound in plasma data for humans, rats, and dogs, QComp significantly enhanced human assay predictions when animal data was available. The r2r^2 score improved from 0.494 (base QSAR) to 0.751 when using both rat and dog data.
  • Peptide Dataset Performance: QComp also showed its versatility by improving predictions in a peptide dataset, increasing the average r2r^2 score from 0.428 to 0.673 for assays with adequate data.

Practical Implications

Data-Driven Decision-Making: Beyond prediction accuracy, QComp's ability to quantify the gain of certainty (GOC) in predictions makes it invaluable for guiding experimental design. For example, when focusing on "MRT, rat" assays in the ADMET-750k dataset, QComp could prioritize which in vitro assays to measure first, optimizing resource allocation in drug discovery.

Limitations and Future Work

Homogeneity of Covariance Assumption: Currently, QComp assumes a uniform covariance matrix across all compounds, which might not capture the nuanced variations in some datasets.

Integration with QSAR Models: Future work could explore the benefits of concurrently training QSAR and QComp models, potentially yielding even greater improvements in prediction accuracy.

Cost-Effective Experimentation: The proposed greedy scheme for experiment prioritization doesn't yet consider the economic or ethical costs. Future developments could incorporate these factors to provide more nuanced guidance.

Conclusion

QComp represents a significant step forward in leveraging sparse and evolving experimental data within the framework of traditional QSAR models. By intelligently integrating new data and providing a structured approach to experimental design, QComp can effectively streamline the drug discovery process, saving both time and resources.

X Twitter Logo Streamline Icon: https://streamlinehq.com