Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Emerging Opportunities of Using Large Language Models for Translation Between Drug Molecules and Indications (2402.09588v2)

Published 14 Feb 2024 in cs.AI and cs.CL

Abstract: A drug molecule is a substance that changes the organism's mental or physical state. Every approved drug has an indication, which refers to the therapeutic use of that drug for treating a particular medical condition. While the LLM, a generative AI technique, has recently demonstrated effectiveness in translating between molecules and their textual descriptions, there remains a gap in research regarding their application in facilitating the translation between drug molecules and indications, or vice versa, which could greatly benefit the drug discovery process. The capability of generating a drug from a given indication would allow for the discovery of drugs targeting specific diseases or targets and ultimately provide patients with better treatments. In this paper, we first propose a new task, which is the translation between drug molecules and corresponding indications, and then test existing LLMs on this new task. Specifically, we consider nine variations of the T5 LLM and evaluate them on two public datasets obtained from ChEMBL and DrugBank. Our experiments show the early results of using LLMs for this task and provide a perspective on the state-of-the-art. We also emphasize the current limitations and discuss future work that has the potential to improve the performance on this task. The creation of molecules from indications, or vice versa, will allow for more efficient targeting of diseases and significantly reduce the cost of drug discovery, with the potential to revolutionize the field of drug discovery in the era of generative AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018. \JournalTitleJAMA 323, 844–853, DOI: 10.1001/jama.2020.1166 (2020). https://jamanetwork.com/journals/jama/articlepdf/2762311/jama_wouters_2020_oi_200015_1663357730.45628.pdf.
  2. Chapter 28 - drug discovery. In Atkinson, A. J., Abernethy, D. R., Daniels, C. E., Dedrick, R. L. & Markey, S. P. (eds.) Principles of Clinical Pharmacology (Second Edition), 439–447, DOI: https://doi.org/10.1016/B978-012369417-1/50068-7 (Academic Press, Burlington, 2007), second edition edn.
  3. Schneider, G. Automating drug discovery. \JournalTitleNature Reviews Drug Discovery 17, 97–113, DOI: 10.1038/nrd.2017.232 (2018).
  4. Computational approaches streamlining drug discovery. \JournalTitleNature 616, 673–685, DOI: 10.1038/s41586-023-05905-z (2023).
  5. Mehta, S. S. Commercializing successful biomedical technologies (Cambridge University Press, Cambridge, England, 2008).
  6. Brown, T. et al. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, 1877–1901 (Curran Associates, Inc., 2020).
  7. OpenAI et al. Gpt-4 technical report (2023). 2303.08774.
  8. Touvron, H. et al. Llama: Open and efficient foundation language models (2023). 2302.13971.
  9. Jiang, A. Q. et al. Mixtral of experts (2024). 2401.04088.
  10. Porter, J. Chatgpt continues to be one of the fastest-growing services ever. https://www.theverge.com/2023/11/6/23948386/chatgpt-active-user-count-openai-developer-conference (2023). Accessed 31-01-2024.
  11. Hu, K. Chatgpt sets record for fastest-growing user base - analyst note. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ (2023). Accessed 31-01-2024.
  12. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Rogers, A., Boyd-Graber, J. & Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 575–593, DOI: 10.18653/v1/2023.acl-long.34 (Association for Computational Linguistics, Toronto, Canada, 2023).
  13. Lee, N. et al. Factuality enhanced language models for open-ended text generation. In Koyejo, S. et al. (eds.) Advances in Neural Information Processing Systems, vol. 35, 34586–34599 (Curran Associates, Inc., 2022).
  14. Adaptive machine translation with large language models. In Nurminen, M. et al. (eds.) Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 227–237 (European Association for Machine Translation, Tampere, Finland, 2023).
  15. Mu, Y. et al. Augmenting large language model translators via translation memories. In Rogers, A., Boyd-Graber, J. & Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023, 10287–10299, DOI: 10.18653/v1/2023.findings-acl.653 (Association for Computational Linguistics, Toronto, Canada, 2023).
  16. Singhal, K. et al. Large language models encode clinical knowledge. \JournalTitleNature 620, 172–180, DOI: 10.1038/s41586-023-06291-2 (2023).
  17. Harnessing LLMs for temporal data - a study on explainable financial time series forecasting. In Wang, M. & Zitouni, I. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, 739–753, DOI: 10.18653/v1/2023.emnlp-industry.69 (Association for Computational Linguistics, Singapore, 2023).
  18. A confederacy of models: a comprehensive evaluation of LLMs on creative writing. In Bouamor, H., Pino, J. & Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023, 14504–14528, DOI: 10.18653/v1/2023.findings-emnlp.966 (Association for Computational Linguistics, Singapore, 2023).
  19. Autonomous chemical research with large language models. \JournalTitleNature 624, 570–578, DOI: 10.1038/s41586-023-06792-0 (2023).
  20. Weininger, D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. \JournalTitleJournal of Chemical Information and Computer Sciences 28, 31–36, DOI: 10.1021/ci00057a005 (1988).
  21. Paul, D. et al. Artificial intelligence in drug discovery and development. \JournalTitleDrug Discovery Today 26, 80–93, DOI: 10.1016/j.drudis.2020.10.010 (2021).
  22. Molgpt: Molecular generation using a transformer-decoder model. \JournalTitleJournal of Chemical Information and Modeling 62, 2064–2076, DOI: 10.1021/acs.jcim.1c00600 (2022). PMID: 34694798, https://doi.org/10.1021/acs.jcim.1c00600.
  23. Unified deep learning model for multitask reaction predictions with explanation. \JournalTitleJournal of Chemical Information and Modeling 62, 1376–1387, DOI: 10.1021/acs.jcim.1c01467 (2022). PMID: 35266390, https://doi.org/10.1021/acs.jcim.1c01467.
  24. Edwards, C. et al. Translation between molecules and natural language. \JournalTitleProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing DOI: 10.18653/v1/2022.emnlp-main.26 (2022).
  25. De novo generation of hit-like molecules from gene expression signatures using artificial intelligence. \JournalTitleNature Communications 11, DOI: 10.1038/s41467-019-13807-w (2020).
  26. De novo design of bioactive small molecules by artificial intelligence. \JournalTitleMolecular Informatics 37, DOI: 10.1002/minf.201700153 (2018).
  27. Smilegnn: Drug–drug interaction prediction based on the smiles and graph neural network. \JournalTitleLife 12, 319, DOI: 10.3390/life12020319 (2022).
  28. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. \JournalTitleMachine Learning: Science and Technology 1, 045024, DOI: 10.1088/2632-2153/aba947 (2020).
  29. Morgan, H. L. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. \JournalTitleJournal of Chemical Documentation 5, 107–113, DOI: 10.1021/c160017a018 (1965). https://doi.org/10.1021/c160017a018.
  30. One molecular fingerprint to rule them all: Drugs, biomolecules, and the metabolome. \JournalTitleJournal of Cheminformatics 12, DOI: 10.1186/s13321-020-00445-4 (2020).
  31. A review of molecular representation in the age of machine learning. \JournalTitleWIREs Computational Molecular Science 12, e1603, DOI: https://doi.org/10.1002/wcms.1603 (2022). https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/wcms.1603.
  32. Mol2vec: Unsupervised machine learning approach with chemical intuition. \JournalTitleJournal of Chemical Information and Modeling 58, 27–35, DOI: 10.1021/acs.jcim.7b00616 (2018).
  33. Efficient estimation of word representations in vector space. In International Conference on Learning Representations (2013).
  34. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186, DOI: 10.18653/v1/N19-1423 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).
  35. Fabian, B. et al. Molecular representation learning with language models and domain-relevant auxiliary tasks. In Machine Learning for Molecules (2020).
  36. Chemberta: Large-scale self-supervised pretraining for molecular property prediction (2020).
  37. Molecular graph generation by decomposition and reassembling. \JournalTitleACS Omega 8, 19575–19586, DOI: 10.1021/acsomega.3c01078 (2023). https://doi.org/10.1021/acsomega.3c01078.
  38. Ganea, O. et al. Geomol: Torsional geometric generation of molecular 3d conformer ensembles. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. & Vaughan, J. W. (eds.) Advances in Neural Information Processing Systems, vol. 34, 13757–13769 (Curran Associates, Inc., 2021).
  39. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, 311–318, DOI: 10.3115/1073083.1073135 (Association for Computational Linguistics, USA, 2002).
  40. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 74–81 (Association for Computational Linguistics, Barcelona, Spain, 2004).
  41. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Goldstein, J., Lavie, A., Lin, C.-Y. & Voss, C. (eds.) Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72 (Association for Computational Linguistics, Ann Arbor, Michigan, 2005).
  42. Text2Mol: Cross-modal molecule retrieval with natural language queries. In Moens, M.-F., Huang, X., Specia, L. & Yih, S. W.-t. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 595–607, DOI: 10.18653/v1/2021.emnlp-main.47 (Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021).
  43. Levenshtein Distance: Information Theory, Computer Science, String (Computer Science), String Metric, Damerau?Levenshtein Distance, Spell Checker, Hamming Distance (Alpha Press, 2009).
  44. Tanimoto, T. An Elementary Mathematical Theory of Classification and Prediction (International Business Machines Corporation, 1958).
  45. Fréchet chemnet distance: A metric for generative models for molecules in drug discovery. \JournalTitleJournal of Chemical Information and Modeling 58, 1736–1741, DOI: 10.1021/acs.jcim.8b00234 (2018).
  46. Wishart, D. S. Drugbank: A comprehensive resource for in silico drug discovery and exploration. \JournalTitleNucleic Acids Research 34, DOI: 10.1093/nar/gkj067 (2006).
  47. Davies, M. et al. Chembl web services: Streamlining access to drug discovery data and utilities. \JournalTitleNucleic Acids Research 43, DOI: 10.1093/nar/gkv352 (2015).
  48. Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. \JournalTitleJournal of Machine Learning Research 21, 1–67 (2020).
  49. Zinc 15 – ligand discovery for everyone. \JournalTitleJournal of Chemical Information and Modeling 55, 2324–2337, DOI: 10.1021/acs.jcim.5b00559 (2015). PMID: 26479676, https://doi.org/10.1021/acs.jcim.5b00559.
  50. Adilov, S. Generative pre-training from molecules. \JournalTitleChemRxiv DOI: 10.33774/chemrxiv-2021-5fwjd (2021).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. David Oniani (14 papers)
  2. Jordan Hilsman (3 papers)
  3. Chengxi Zang (8 papers)
  4. Junmei Wang (5 papers)
  5. Lianjin Cai (1 paper)
  6. Jan Zawala (2 papers)
  7. Yanshan Wang (50 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com