Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Chemical Language: A Multimodal Approach to Enhance Molecular Property Prediction (2306.14919v1)

Published 22 Jun 2023 in physics.chem-ph, cs.LG, and q-bio.QM

Abstract: We present a novel multimodal LLM approach for predicting molecular properties by combining chemical language representation with physicochemical features. Our approach, MULTIMODAL-MOLFORMER, utilizes a causal multistage feature selection method that identifies physicochemical features based on their direct causal effect on a specific target property. These causal features are then integrated with the vector space generated by molecular embeddings from MOLFORMER. In particular, we employ Mordred descriptors as physicochemical features and identify the Markov blanket of the target property, which theoretically contains the most relevant features for accurate prediction. Our results demonstrate a superior performance of our proposed approach compared to existing state-of-the-art algorithms, including the chemical language-based MOLFORMER and graph neural networks, in predicting complex tasks such as biodegradability and PFAS toxicity estimation. Moreover, we demonstrate the effectiveness of our feature selection method in reducing the dimensionality of the Mordred feature space while maintaining or improving the model's performance. Our approach opens up promising avenues for future research in molecular property prediction by harnessing the synergistic potential of both chemical language and physicochemical features, leading to enhanced performance and advancements in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh, “Machine learning for molecular and materials science,” Nature, vol. 559, no. 7715, pp. 547–555, 2018.
  2. L. Pattanaik and C. W. Coley, “Molecular representation: going long on fingerprints,” Chem, vol. 6, no. 6, pp. 1204–1207, 2020.
  3. J. Born and M. Manica, “Regression transformer enables concurrent sequence regression and generation for molecular language modelling,” Nature Machine Intelligence, pp. 1–13, 2023.
  4. D. Weininger, “Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules,” Journal of Chemical Information and Computer Sciences, vol. 28, pp. 31–36, 1988.
  5. H. Moriwaki, Y.-S. Tian, N. Kawashita, and T. Takagi, “Mordred: a molecular descriptor calculator,” Journal of cheminformatics, vol. 10, no. 1, pp. 1–14, 2018.
  6. W. X. Shen, X. Zeng, F. Zhu, Y. L. Wang, C. Qin, Y. Tan, Y. Y. Jiang, and Y. Z. Chen, “Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations,” Nature Machine Intelligence, vol. 3, no. 4, pp. 334–343, 2021.
  7. P. Kirkpatrick and C. Ellis, “Chemical space,” Nature, vol. 432, no. 7019, pp. 823–824, 2004.
  8. S. Wang, Y. Guo, Y. Wang, H. Sun, and J. Huang, “Smiles-bert: large scale unsupervised pre-training for molecular property prediction,” in Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, 2019, pp. 429–436.
  9. O. A. von Lilienfeld and K. Burke, “Retrospective on a decade of machine learning for chemical discovery,” Nature communications, vol. 11, no. 1, p. 4895, 2020.
  10. X.-C. Zhang, C.-K. Wu, J.-C. Yi, X.-X. Zeng, C.-Q. Yang, A.-P. Lu, T.-J. Hou, and D.-S. Cao, “Pushing the boundaries of molecular property prediction for drug discovery with multitask learning bert enhanced by smiles enumeration,” Research, vol. 2022, p. 0004, 2022.
  11. D. S. Wigh, J. M. Goodman, and A. A. Lapkin, “A review of molecular representation in the age of machine learning,” Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 12, no. 5, p. e1603, 2022.
  12. P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, C. Bekas, and A. A. Lee, “Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction,” ACS central science, vol. 5, no. 9, pp. 1572–1583, 2019.
  13. J. Ross, B. Belgodere, V. Chenthamarakshan, I. Padhi, Y. Mroueh, and P. Das, “Large-scale chemical language representations capture molecular structure and properties,” Nature Machine Intelligence, vol. 4, no. 12, pp. 1256–1264, 2022.
  14. Y. Liu, R. Zhang, T. Li, J. Jiang, J. Ma, and P. Wang, “Molrope-bert: An enhanced molecular representation with rotary position embedding for molecular property prediction,” Journal of Molecular Graphics and Modelling, vol. 118, p. 108344, 2023.
  15. M. A. Skinnider, R. G. Stacey, D. S. Wishart, and L. J. Foster, “Chemical language models enable navigation in sparsely populated chemical space,” Nature Machine Intelligence, vol. 3, no. 9, pp. 759–770, 2021.
  16. M. Moret, F. Grisoni, P. Katzberger, and G. Schneider, “Perplexity-based molecule ranking and bias estimation of chemical language models,” Journal of chemical information and modeling, vol. 62, no. 5, pp. 1199–1206, 2022.
  17. A. C. Mater and M. L. Coote, “Deep learning in chemistry,” Journal of chemical information and modeling, vol. 59, no. 6, pp. 2545–2559, 2019.
  18. J. A. Keith, V. Vassilev-Galindo, B. Cheng, S. Chmiela, M. Gastegger, K.-R. Müller, and A. Tkatchenko, “Combining machine learning and computational chemistry for predictive insights into chemical systems,” Chemical reviews, vol. 121, no. 16, pp. 9816–9872, 2021.
  19. Q. Yang, Y. Liu, J. Cheng, Y. Li, S. Liu, Y. Duan, L. Zhang, and S. Luo, “An ensemble structure and physicochemical (spoc) descriptor for machine-learning prediction of chemical reaction and molecular properties,” ChemPhysChem, vol. 23, no. 14, p. e202200255, 2022.
  20. R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  21. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  22. A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos Jr, C. Xiong, Z. Z. Sun, R. Socher et al., “Large language models generate functional protein sequences across diverse families,” Nature Biotechnology, pp. 1–8, 2023.
  23. N. Ferruz and B. Höcker, “Controllable protein design with language models,” Nature Machine Intelligence, vol. 4, no. 6, pp. 521–532, 2022.
  24. J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” arXiv preprint arXiv:2104.09864, 2021.
  25. G. Landrum, “Rdkit documentation,” Release, vol. 1, no. 1-79, p. 4, 2013.
  26. B. Hollas, “An analysis of the autocorrelation descriptor for molecules,” Journal of mathematical chemistry, vol. 33, pp. 91–101, 2003.
  27. I. Guyon, C. Aliferis et al., “Causal feature selection,” in Computational methods of feature selection.   Chapman and Hall/CRC, 2007, pp. 79–102.
  28. D. Koller and M. Sahami, “Toward optimal feature selection,” Stanford InfoLab, Tech. Rep., 1996.
  29. A. Hassan, J. H. Paik, S. Khare, and S. A. Hassan, “Ppfs: Predictive permutation feature selection,” arXiv preprint arXiv:2110.10713, 2021.
  30. C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos, “Local causal and markov blanket induction for causal discovery and feature selection for classification part i: algorithms and empirical evaluation.” Journal of Machine Learning Research, vol. 11, no. 1, 2010.
  31. M. G. Evich, M. J. Davis, J. P. McCord, B. Acrey, J. A. Awkerman, D. R. Knappe, A. B. Lindstrom, T. F. Speth, C. Tebes-Stevens, M. J. Strynar et al., “Per-and polyfluoroalkyl substances in the environment,” Science, vol. 375, no. 6580, p. eabg9065, 2022.
  32. F. Cheng, Y. Ikenaga, Y. Zhou, Y. Yu, W. Li, J. Shen, Z. Du, L. Chen, C. Xu, G. Liu et al., “In silico assessment of chemical biodegradability,” Journal of chemical information and modeling, vol. 52, no. 3, pp. 655–669, 2012.
  33. S. E. Fenton, A. Ducatman, A. Boobis, J. C. DeWitt, C. Lau, C. Ng, J. S. Smith, and S. M. Roberts, “Per-and polyfluoroalkyl substance toxicity and human health review: Current state of knowledge and strategies for informing future research,” Environmental toxicology and chemistry, vol. 40, no. 3, pp. 606–630, 2021.
  34. S. Su and P. M. Kang, “Systemic review of biodegradable nanomaterials in nanomedicine,” Nanomaterials, vol. 10, no. 4, p. 656, 2020.
  35. J. Feinstein, G. Sivaraman, K. Picel, B. Peters, Á. Vázquez-Mayagoitia, A. Ramanathan, M. MacDonell, I. Foster, and E. Yan, “Uncertainty-informed deep transfer learning of perfluoroalkyl and polyfluoroalkyl substance toxicity,” Journal of chemical information and modeling, vol. 61, no. 12, pp. 5793–5803, 2021.
  36. M. Lee and K. Min, “A comparative study of the performance for predicting biodegradability classification: the quantitative structure–activity relationship model vs the graph convolutional network,” ACS omega, vol. 7, no. 4, pp. 3649–3655, 2022.
  37. A. E. Comesana, T. T. Huntington, C. D. Scown, K. E. Niemeyer, and V. H. Rapp, “A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties,” Fuel, vol. 321, p. 123836, 2022.
  38. M. B. Kursa, A. Jankowski, and W. R. Rudnicki, “Boruta–a system for feature selection,” Fundamenta Informaticae, vol. 101, no. 4, pp. 271–285, 2010.
  39. X.-w. Chen and J. C. Jeong, “Enhanced recursive feature elimination,” in Sixth international conference on machine learning and applications (ICMLA 2007).   IEEE, 2007, pp. 429–435.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Eduardo Soares (11 papers)
  2. Emilio Vital Brazil (16 papers)
  3. Karen Fiorela Aquino Gutierrez (1 paper)
  4. Renato Cerqueira (16 papers)
  5. Dan Sanders (2 papers)
  6. Kristin Schmidt (3 papers)
  7. Dmitry Zubarev (12 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.