QComp: A QSAR-Based Data Completion Framework for Drug Discovery (2405.11703v1)
Abstract: In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.
- Encyclopedia of computational chemistry. Wiley Online Library, 1998.
- Random forest: a classification and regression tool for compound classification and qsar modeling. Journal of Chemical Information and Computer Sciences, 43(6):1947–1958, 2003.
- William S Noble. What is a support vector machine? Nature Biotechnology, 24(12):1565–1567, 2006.
- Gaussian processes: a method for automatic qsar modeling of adme properties. Journal of Chemical Information and Modeling, 47(5):1847–1857, 2007.
- Deep neural nets as a method for quantitative structure–activity relationships. Journal of Chemical Information and Modeling, 55(2):263–274, 2015.
- Deep learning in drug discovery. Molecular Informatics, 35(1):3–14, 2016.
- Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017.
- Analyzing Learned Molecular Representations for Property Prediction. Journal of Chemical Information and Modeling, 59(8):3370–3388, 2019.
- Deeppurpose: a deep learning library for drug–target interaction prediction. Bioinformatics, 36(22-23):5545–5547, 2020.
- Integrating qsar modelling and deep learning in drug discovery: the emergence of deep qsar. Nature Reviews Drug Discovery, pages 1–15, 2023.
- Multi-task neural networks for qsar predictions. arXiv preprint arXiv:1406.1231, 2014.
- Modeling industrial admet data with multitask networks. arXiv preprint arXiv:1606.08793, 2016.
- Predictive Multitask Deep Neural Network Models for ADME-Tox Properties: Learning from Large Data Sets. Journal of Chemical Information and Modeling, 59(3):1253–1268, 3 2019.
- Improvement in admet prediction with multitask deep featurization. Journal of Medicinal Chemistry, 63(16):8835–8848, 2020.
- Qsar without borders. Chemical Society Reviews, 49(11):3525–3564, 2020.
- Machine learning for in silico admet prediction. Artificial Intelligence in Drug Design, pages 447–460, 2022.
- Analysis of the benefits of imputation models over traditional qsar models for toxicity prediction. Journal of Cheminformatics, 14(1):1–27, 2022.
- Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525, 2001.
- A bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088–2096, 2003.
- Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5):498–513, 2011.
- Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics, 21(2):187–198, 2005.
- Flexible multivariate imputation by MICE. Leiden: TNO, 1999.
- Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012.
- Macau: scalable bayesian multi-relational factorization with side information using mcmc. arXiv preprint arXiv:1509.04610, 2015.
- Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
- Imputation of assay bioactivity data using deep learning. Journal of Chemical Information and Modeling, 59(3):1197–1204, 2019.
- Profile-QSAR 2.0: Kinase Virtual Screening Accuracy Comparable to Four-Concentration IC50s for Realistically Novel Compounds. Journal of Chemical Information and Modeling, 57(8):2077–2088, 2017.
- Missing the point: Non-convergence in iterative imputation algorithms. In First Workshop on the Art of Learning with Missing Values (Artemiss) hosted by the 37 th International Conference on Machine Learning (ICML), 2020.
- Practical strategies for handling breakdown of multiple imputation procedures. Emerging Themes in Epidemiology, 18(1):5, 2021.
- Extension of pqsar: Ensemble model generated by random forest and partial least squares regressions. IEEE Access, 8:180087–180099, 2020.
- Predicting Total Drug Clearance and Volumes of Distribution Using the Machine Learning-Mediated Multimodal Method through the Imputation of Various Nonclinical Data. Journal of Chemical Information and Modeling, 62(17):4057–4065, 9 2022.
- PubChem 2023 update. Nucleic Acids Research, 51(D1):D1373–D1380, 1 2023.
- Predicting Fraction Unbound in Human Plasma from Chemical Structure: Improved Accuracy in the Low Value Ranges. Molecular Pharmaceutics, 15(11):5302–5311, 11 2018.
- Reliable Prediction of Caco-2 Permeability by Supervised Recursive Machine Learning Approaches. Pharmaceutics, 14(10), 10 2022.
- Combining machine learning and molecular dynamics to predict P-glycoprotein substrates. Journal of Chemical Information and Modeling, 60(10):4730–4749, 10 2020.
- Pred-hERG: A Novel web-Accessible Computational Tool for Predicting Cardiac Toxicity. Molecular Informatics, 34(10):698–701, 10 2015.
- Comparison of logP and logD correction models trained with public and proprietary data sets. Journal of Computer-Aided Molecular Design, 36(3):253–262, 3 2022.
- Pruned Machine Learning Models to Predict Aqueous Solubility. ACS Omega, 5(27):16562–16567, 7 2020.
- Boosting the predictive performance with aqueous solubility dataset curation. Scientific Data, 9(1), 12 2022.
- Predicting Solubility Limits of Organic Solutes for a Wide Range of Solvents and Temperatures. Journal of the American Chemical Society, 144(24):10785–10797, 6 2022.
- Chemprop: A machine learning package for chemical property prediction. Journal of Chemical Information and Modeling, 64:9–17, 2024.
- Effect of missing data on multitask prediction methods. Journal of Cheminformatics, 10(1):1–12, 2018.
- Predicting Critical Properties and Acentric Factors of Fluids Using Multitask Machine Learning. Journal of Chemical Information and Modeling, 63(15):4574–4588, 8 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- fancyimpute: An imputation library for python. URL https://github. com/iskandr/fancyimpute, 2016.
- mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45:1–67, 2011.
- Greg Landrum. RDKit: Open-Source Cheminformatics, 2006. (accessed November 29, 2023).