Predicting small molecules solubilities on endpoint devices using deep ensemble neural networks (2307.05318v4)
Abstract: Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is usable at https://mol.dev.
- AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data, 6(1):143, August 2019. ISSN 2052-4463. doi:10.1038/s41597-019-0151-1.
- Federico Dajas. Life or death: neuroprotective and anticancer effects of quercetin. J. Ethnopharmacol., 143(2):383–396, September 2012. ISSN 0378-8741, 1872-7573. doi:10.1016/j.jep.2012.07.005.
- Bridging solubility between drug discovery and development. Drug Discov. Today, 17(9-10):486–495, May 2012. ISSN 1359-6446, 1878-5832. doi:10.1016/j.drudis.2011.11.007.
- Low solubility in drug development: de-convoluting the relative importance of solvation and crystal packing. J. Pharm. Pharmacol., 67(6):847–856, June 2015. ISSN 0022-3573, 2042-7158. doi:10.1111/jphp.12393.
- Discovery solubility measurement and assessment of small molecules with drug development in mind. Drug Discov. Today, 27(5):1315–1325, May 2022. ISSN 1359-6446, 1878-5832. doi:10.1016/j.drudis.2022.01.017.
- The CamSol method of rational design of protein mutants with enhanced solubility. J. Mol. Biol., 427(2):478–490, January 2015. ISSN 0022-2836, 1089-8638. doi:10.1016/j.jmb.2014.09.026.
- Determination of dissociation constants of flavonoids by capillary electrophoresis. Electrophoresis, 26(10):1886–1895, 2005. ISSN 0173-0835. doi:10.1002/elps.200410258.
- Toward a more holistic framework for solvent selection. Org. Process Res. Dev., 20(4):760–773, April 2016. ISSN 1083-6160. doi:10.1021/acs.oprd.6b00015.
- Solubility prediction of pharmaceutical and chemical compounds in pure and mixed solvents using predictive models. Ind. Eng. Chem. Res., 51(1):464–473, January 2012. ISSN 0888-5885. doi:10.1021/ie201344k.
- Solubility and partitioning i: Solubility of nonelectrolytes in water. J. Pharm. Sci., 69(8):912–922, August 1980. ISSN 0022-3549. doi:10.1002/jps.2600690814.
- Y Ran and S H Yalkowsky. Prediction of drug solubility by the general solubility equation (GSE). J. Chem. Inf. Comput. Sci., 41(2):354–357, March 2001. ISSN 0095-2338. doi:10.1021/ci000338c.
- Group-contribution estimation of activity coefficients in nonideal liquid mixtures. AIChE J., 1975. ISSN 0001-1541.
- Statistical thermodynamics of liquid mixtures: A new expression for the excess gibbs energy of partly or completely miscible systems. AIChE J., 21(1):116–128, January 1975. ISSN 0001-1541, 1547-5905. doi:10.1002/aic.690210115.
- G Maurer and J M Prausnitz. On the derivation and extension of the uniquac equation. Fluid Phase Equilib., 2(2):91–99, January 1978. ISSN 0378-3812, 1879-0224. doi:10.1016/0378-3812(78)85002-x.
- In silico prediction of drug solubility. 3. free energy of solvation in pure amorphous matter, 2007a.
- In silico prediction of drug solubility: 2. free energy of solvation in pure melts, 2007b.
- Solubility prediction from first principles: a density of states approach. Phys. Chem. Chem. Phys., 20(32):20981–20987, August 2018. ISSN 1463-9076, 1463-9084. doi:10.1039/c8cp01786g.
- Solubility prediction for a soluble organic molecule via chemical potentials from density of states. J. Chem. Phys., 151(18):184113, November 2019. ISSN 0021-9606, 1089-7690. doi:10.1063/1.5117281.
- Quantum mechanical continuum solvation models. Chem. Rev., 105(8):2999–3093, August 2005. ISSN 0009-2665. doi:10.1021/cr9904009.
- Prediction of solubility parameters for polymers by a QSPR model. QSAR Comb. Sci., 25(2):156–161, February 2006. ISSN 1611-020X, 1611-0218. doi:10.1002/qsar.200530138.
- QSPR prediction of aqueous solubility of Drug-Like organic compounds, 2007.
- QSPR studies on aqueous solubilities of drug-like compounds. Int. J. Mol. Sci., 10(6):2558–2577, June 2009. ISSN 1422-0067. doi:10.3390/ijms10062558.
- QSPR study on the estimation of solubility of drug-like organic compounds: A case of barbiturates, 2009.
- J Huuskonen. Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. J. Chem. Inf. Comput. Sci., 40(3):773–777, May 2000. ISSN 0095-2338. doi:10.1021/ci9901338.
- John S Delaney. ESOL: estimating aqueous solubility directly from molecular structure. J. Chem. Inf. Comput. Sci., 44(3):1000–1005, May 2004. ISSN 0095-2338. doi:10.1021/ci034243x.
- A review of methods for the calculation of solution free energies and the modelling of systems in solution. Phys. Chem. Chem. Phys., 17(9):6174–6191, March 2015. ISSN 1463-9076, 1463-9084. doi:10.1039/c5cp00288e.
- Uniting cheminformatics and chemical theory to predict the intrinsic aqueous solubility of crystalline druglike molecules. J. Chem. Inf. Model., 54(3):844–856, March 2014. ISSN 1549-9596, 1549-960X. doi:10.1021/ci4005805.
- Pushing the limits of solubility prediction via quality-oriented data selection. iScience, 24(1):101961, January 2021. ISSN 2589-0042. doi:10.1016/j.isci.2020.101961. URL http://dx.doi.org/10.1016/j.isci.2020.101961.
- Solubility challenge: can you predict solubilities of 32 molecules using a database of 100 reliable measurements? J. Chem. Inf. Model., 48(7):1289–1303, July 2008. ISSN 1549-9596. doi:10.1021/ci800058v.
- Solubility challenge revisited after ten years, with multilab shake-flask data, using tight (SD 0.17 log) and loose (SD 0.62 log) test sets. J. Chem. Inf. Model., 59(6):3036–3040, June 2019. ISSN 1549-9596, 1549-960X. doi:10.1021/acs.jcim.9b00345.
- Findings of the challenge to predict aqueous solubility. J. Chem. Inf. Model., 49(1):1–5, January 2009. ISSN 1549-9596. doi:10.1021/ci800436c.
- Findings of the second challenge to predict aqueous solubility. J. Chem. Inf. Model., 60(10):4791–4803, October 2020. ISSN 1549-9596, 1549-960X. doi:10.1021/acs.jcim.0c00701.
- Accurate solubility prediction with error bars for electrolytes: a machine learning approach. J. Chem. Inf. Model., 47(2):407–424, March 2007. ISSN 1549-9596. doi:10.1021/ci600205g.
- Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J. Chem. Inf. Model., 53(7):1563–1575, July 2013. ISSN 1549-9596, 1549-960X. doi:10.1021/ci400187y.
- Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms. J. Cheminform., 13(1):98, December 2021. ISSN 1758-2946. doi:10.1186/s13321-021-00575-3. URL http://dx.doi.org/10.1186/s13321-021-00575-3.
- Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models. J. Cheminform., 15(1):99, October 2023. ISSN 1758-2946. doi:10.1186/s13321-023-00752-6. URL http://dx.doi.org/10.1186/s13321-023-00752-6.
- Solubility prediction from molecular properties and analytical data using an in-phase deep neural network (Ip-DNN). ACS Omega, 6(22):14278–14287, June 2021. ISSN 2470-1343. doi:10.1021/acsomega.1c01035. URL http://dx.doi.org/10.1021/acsomega.1c01035.
- Attention is all you need. June 2017. URL http://arxiv.org/abs/1706.03762.
- SMILES-BERT: large scale unsupervised pre-training for molecular property prediction. Proceedings of the 10th ACM, 2019.
- Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv, 2020.
- SolTranNet–A machine learning tool for fast aqueous solubility prediction. J. Chem. Inf. Model., 61(6):2530–2536, June 2021. ISSN 1549-9596. doi:10.1021/acs.jcim.1c00331.
- Regression transformer: Concurrent sequence regression and generation for molecular language modeling. February 2022. URL http://arxiv.org/abs/2202.01338.
- Molformer: Large scale chemical language representations capture molecular structure and properties. May 2022.
- The rise and fall of a scaffold: a trend analysis of scaffolds in the medicinal chemistry literature. Journal of Medicinal Chemistry, 61(11):4688–4703, 2017.
- D Seelow. Editorial: the 18th annual nucleic acids research web server issue 2020. Nucleic Acids Res, 48:W1–W4, 2020.
- Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021. ISSN 0036-8075. doi:10.1126/science.abj8754.
- Automated topology builder version 3.0: Prediction of solvation free enthalpies in water and hexane. J. Chem. Theory Comput., 14(11):5834–5845, 2018. ISSN 1549-9618, 1549-9626. doi:10.1021/acs.jctc.8b00768.
- The MolSSI QCA rchive project: An open-source platform to compute, organize, and share quantum chemistry data. Wiley Interdiscip. Rev. Comput. Mol. Sci., 11(2), March 2021. ISSN 1759-0876, 1759-0884. doi:10.1002/wcms.1491.
- Serverless prediction of peptide properties with recurrent neural networks. Journal of Chemical Information and Modeling, 63(8):2546–2553, April 2023. doi:10.1021/acs.jcim.2c01317. URL https://doi.org/10.1021/acs.jcim.2c01317.
- Simple and scalable predictive uncertainty estimation using deep ensembles. December 2016.
- David Weininger. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Model., 28(1):31–36, February 1988. ISSN 1549-9596, 1549-960X. doi:10.1021/ci00057a005.
- SELFIES and the future of molecular string representations. Patterns (N Y), 3(10):100588, October 2022. ISSN 2666-3899. doi:10.1016/j.patter.2022.100588. URL http://dx.doi.org/10.1016/j.patter.2022.100588.
- An update on PUG-REST: RESTful interface for programmatic access to PubChem. Nucleic Acids Res., 46(W1):W563–W570, July 2018. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gky294. URL http://dx.doi.org/10.1093/nar/gky294.
- CMD + V for chemistry: Image to chemical structure conversion directly done in the clipboard. Appl. AI Lett., 5(1), February 2024. ISSN 2689-5595. doi:10.1002/ail2.91. URL https://onlinelibrary.wiley.com/doi/10.1002/ail2.91.
- Optimal selection of molecular descriptors for antimicrobial peptides classification: an evolutionary feature weighting approach. BMC Genomics, 19(Suppl 7):672, September 2018. ISSN 1471-2164. doi:10.1186/s12864-018-5030-1.
- Gerald M Maggiora. On outliers and activity cliffs why qsar often disappoints, 2006.
- Tensorflow. js: Machine learning for the web and beyond. Proceedings of Machine Learning and Systems, 1:309–321, 2019.
- Aqueous solubility prediction based on weighted atom type counts and solvent accessible surface areas. J. Chem. Inf. Model., 49(3):571–581, March 2009. ISSN 1549-9596. doi:10.1021/ci800406y.
- Recent advances on aqueous solubility prediction. Comb. Chem. High Throughput Screen., 14(5):328–338, June 2011. ISSN 1386-2073, 1875-5402. doi:10.2174/138620711795508331.
- Landrum. Rdkit documentation. RELease 1.0. ISSN 1047-935X, 1533-3752.
- Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform., 11(1):71, November 2019. ISSN 1758-2946. doi:10.1186/s13321-019-0393-0.
- Data augmentation strategies to improve reaction yield predictions and estimate uncertainty. ChemRxiv, November 2020. doi:10.26434/chemrxiv.13286741.v1.
- Aleatoric and epistemic uncertainty with random forests. In Advances in Intelligent Data Analysis XVIII, pages 444–456. Springer International Publishing, 2020. doi:10.1007/978-3-030-44584-3_35.
- Estimating uncertainty in deep learning for reporting confidence to clinicians in medical image segmentation and diseases detection. Comput. Intell., 37(2):701–734, May 2021. ISSN 0824-7935, 1467-8640. doi:10.1111/coin.12411.
- Evaluating scalable uncertainty estimation methods for deep Learning-Based molecular property prediction. J. Chem. Inf. Model., 60(6):2697–2717, June 2020. ISSN 1549-9596, 1549-960X. doi:10.1021/acs.jcim.9b00975.
- François Chollet and others. Keras: The python deep learning library, June 2018.
- TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
- S Hochreiter and J Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, November 1997. ISSN 0899-7667. doi:10.1162/neco.1997.9.8.1735.
- Dive into Deep Learning. Cambridge University Press, 2023. https://D2L.ai.
- Layer normalization. July 2016.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 2015. PMLR.
- Revisiting internal covariate shift for batch normalization, 2021.
- How does batch normalization help optimization? In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Understanding and improving layer normalization. November 2019.
- A comprehensive survey on regularization strategies in machine learning. Inf. Fusion, 80:146–166, April 2022. ISSN 1566-2535. doi:10.1016/j.inffus.2021.11.005.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059, New York, New York, USA, 2016. PMLR.
- Adam: A method for stochastic optimization. December 2014.
- Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, pages 220–229, New York, NY, USA, January 2019. Association for Computing Machinery. ISBN 9781450361255. doi:10.1145/3287560.3287596.
- Chun Wei Yap. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints, 2011.
- Short-term runoff prediction with GRU and LSTM networks without requiring time step optimization during sample generation. J. Hydrol., 589:125188, October 2020. ISSN 0022-1694. doi:10.1016/j.jhydrol.2020.125188.
- A cryptocurrency prediction model using LSTM and GRU algorithms. In 2021 IEEE/ACIS 6th International Conference on Big Data, Cloud Computing, and Data Science (BCD), pages 37–44. ieeexplore.ieee.org, September 2021a. doi:10.1109/BCD51206.2021.9581397.
- Cryptocurrency price prediction using LSTM and GRU networks, 2022.
- Wind speed prediction using deep Learning-LSTM and GRU, 2021.
- Short-term offshore wind speed forecast by seasonal ARIMA - a comparison against GRU and LSTM. Energy, 227:120492, July 2021. ISSN 0360-5442. doi:10.1016/j.energy.2021.120492.
- Comparing LSTM and GRU models to predict the condition of a pulp paper press. Energies, 14(21):6958, October 2021. ISSN 1996-1073, 1996-1073. doi:10.3390/en14216958.
- Are GRU cells more specific and LSTM cells more sensitive in motive classification of text? Front Artif Intell, 3:40, June 2020. ISSN 2624-8212. doi:10.3389/frai.2020.00040.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. December 2014.
- Can human experts predict solubility better than computers? J. Cheminform., 9(1):63, December 2017. ISSN 1758-2946. doi:10.1186/s13321-017-0250-y.
- Evaluation of deep learning architectures for aqueous solubility prediction. ACS Omega, 7(18):15695–15710, May 2022. ISSN 2470-1343. doi:10.1021/acsomega.2c00642.
- SolvBERT for solvation free energy and solubility prediction: a demonstration of an NLP model for predicting the properties of molecular complexes. Digital Discovery, 2(2):409–421, April 2023. ISSN 2635-098X. doi:10.1039/D2DD00107A. URL https://pubs.rsc.org/en/content/articlelanding/2023/dd/d2dd00107a.
- A merged molecular representation learning for molecular properties prediction with a web-based service. Sci. Rep., 11(1):11028, May 2021b. ISSN 2045-2322. doi:10.1038/s41598-021-90259-7.
- Aqueous solubility: Methods of estimation for organic compounds. (No Title), 1992.
- G Klopman and H Zhu. Estimation of the aqueous solubility of organic molecules by the group contribution approach. J. Chem. Inf. Comput. Sci., 41(2):439–445, 2001. ISSN 0095-2338. doi:10.1021/ci000152d.
- ADME evaluation in drug discovery. 4. prediction of aqueous solubility based on atom contribution approach. J. Chem. Inf. Comput. Sci., 44(1):266–275, 2004. ISSN 0095-2338. doi:10.1021/ci034184n.
- Development of reliable aqueous solubility models and their application in druglike analysis. J. Chem. Inf. Model., 47(4):1395–1404, June 2007. ISSN 1549-9596. doi:10.1021/ci700096r.
- Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nat. Commun., 11(1):5753, November 2020. ISSN 2041-1723. doi:10.1038/s41467-020-19594-z.
- A self-attention based message passing neural network for predicting molecular lipophilicity and aqueous solubility. J. Cheminform., 12(1):15, February 2020. ISSN 1758-2946. doi:10.1186/s13321-020-0414-z.
- Improved prediction of aqueous solubility of novel compounds by going deeper with deep learning. Front. Oncol., 10:121, February 2020. ISSN 2234-943X. doi:10.3389/fonc.2020.00121.
- Comparative analysis of molecular fingerprints in prediction of drug combination effects. Brief. Bioinform., 22(6), November 2021. ISSN 1467-5463, 1477-4054. doi:10.1093/bib/bbab291. URL http://dx.doi.org/10.1093/bib/bbab291.