Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DrugAssist: A Large Language Model for Molecule Optimization (2401.10334v1)

Published 28 Dec 2023 in q-bio.QM, cs.AI, cs.CL, and cs.LG

Abstract: Recently, the impressive performance of LLMs on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback. These non-interactive approaches overlook the fact that the drug discovery process is actually one that requires the integration of expert experience and iterative refinement. To address this gap, we propose DrugAssist, an interactive molecule optimization model which performs optimization through human-machine dialogue by leveraging LLM's strong interactivity and generalizability. DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization. In addition, we publicly release a large instruction-based dataset called MolOpt-Instructions for fine-tuning LLMs on molecule optimization tasks. We have made our code and data publicly available at https://github.com/blazerye/DrugAssist, which we hope to pave the way for future research in LLMs' application for drug discovery.

DrugAssist: A LLM for Molecule Optimization

The paper "DrugAssist: A LLM for Molecule Optimization" introduces an innovative approach to addressing a gap in the application of LLMs within the domain of drug discovery, specifically focusing on molecule optimization. The advancement of LLMs has largely influenced an array of fields; however, their applicability to molecule optimization has not been comprehensively explored. The authors present DrugAssist, a model that leverages the interactive capabilities of LLMs to enhance molecule optimization through human-machine dialogue.

Contributions

The paper's contributions are multifaceted:

  1. Interactive Molecule Optimization Model: DrugAssist emerges as an interactive model that incorporates human feedback in optimizing molecular structures. The emphasis on dialogue-based interaction is a distinct departure from existing non-interactive methodologies, which typically isolate the optimization problem from expert feedback loops.
  2. MolOpt-Instructions Dataset: The creation and release of the "MolOpt-Instructions" dataset constitute a significant step forward. This dataset provides a robust foundation for fine-tuning LLMs for molecule optimization tasks, offering a substantial collection of molecule pairs with diverse property differences and similarity constraints.
  3. Empirical Performance: The paper provides evidence of DrugAssist’s performance through rigorous evaluation. The model achieves leading results in tasks involving the optimization of multiple molecular properties, addressing the real-world requirement to maintain optimized property values within specified ranges.

Methodology

The methodology underscores the creation of the MolOpt-Instructions dataset and the instruction tuning of the Llama2-7B-Chat model. The dataset boasts over a million molecule pairs, integrating various molecular properties pertinent to drug development. The instruction tuning is executed via multi-task learning to counteract phenomena such as catastrophic forgetting, ensuring that the model retains its general capabilities while honing its molecule-specific skills.

Results and Comparison

In comparative analyses, DrugAssist exhibits superior performance over traditional sequence-based approaches, such as Seq2Seq and Transformer models, by achieving higher success rates for solubility and BBBP optimization tasks. Furthermore, the model showcases advanced capabilities in iterative optimization and property transferability. When juxtaposed with other LLM implementations, including popular models like GPT-3.5-turbo, DrugAssist demonstrates superior ability to adaptively meet task requirements through interactive dialogue. Its capacity for iterative and multi-property optimization delineates a practical alignment with real-world pharmaceutical demands.

Implications and Future Directions

The implications of DrugAssist's development span both practical and theoretical realms. Practically, the introduction of an LLM capable of interactive optimization could significantly streamline the drug discovery pipeline, fostering more efficient integration of computational and expert-driven processes. Theoretically, this work poses interesting questions regarding the broader applicability of interactive LLM frameworks beyond molecule optimization.

Future research might explore the integration of multimodal data handling, further enhancing the model's interaction capabilities and potentially broadening its application scope within biomedical domains. Additionally, addressing issues related to model hallucinations and response accuracy can potentiate the optimization capabilities of DrugAssist.

In conclusion, DrugAssist represents a notable progression in the application of LLMs to molecular science, harnessing the power of interactive AI to refine and optimize drug discovery processes through human-centric approaches. The publicly available dataset and model encourage further exploration and development within this exciting intersection of machine learning and chemistry.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Molecular generation with recurrent neural networks (rnns). arXiv preprint arXiv:1705.04612, 2017.
  2. Comprehensive evaluation of molecule property prediction with chatgpt. Methods, 2023.
  3. Syntax-directed variational autoencoder for molecule generation. In Proceedings of the international conference on learning representations, 2018.
  4. mmpdb: An open-source matched molecular pair platform for large multiproperty data sets. Journal of chemical information and modeling, 58(5):902–910, 2018.
  5. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021.
  6. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492, 2023.
  7. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018, 2023.
  8. Generative recurrent networks for de novo drug design. Molecular informatics, 37(1-2):1700111, 2018.
  9. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
  10. Molecular optimization by capturing chemist’s intuition using deep neural networks. Journal of cheminformatics, 13(1):1–17, 2021.
  11. Transformer-based molecular optimization beyond matched molecular pairs. Journal of cheminformatics, 14(1):18, 2022.
  12. iDrug, 2020. URL https://drug.ai.tencent.com.
  13. Zinc- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45(1):177–182, 2005.
  14. Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning, pp.  2323–2332. PMLR, 2018a.
  15. Learning multimodal graph-to-graph translation for molecular optimization. arXiv preprint arXiv:1812.01070, 2018b.
  16. Hierarchical generation of molecular graphs using structural motifs. In International conference on machine learning, pp.  4839–4848. PMLR, 2020.
  17. drugan: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Molecular pharmaceutics, 14(9):3098–3104, 2017.
  18. A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv, pp.  2023–11, 2023a.
  19. A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536, 2023b.
  20. Retrieval-augmented multi-modal chain-of-thoughts reasoning for large language models. arXiv preprint arXiv:2312.01714, 2023a.
  21. Constrained graph variational autoencoders for molecule design. Advances in neural information processing systems, 31, 2018.
  22. Chatgpt-powered conversational drug editing using retrieval and domain feedback. arXiv preprint arXiv:2305.18090, 2023b.
  23. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023.
  24. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
  25. Pathways language model (palm): Scaling to 540 billion parameters for breakthrough performance. Google AI Blog, 2022.
  26. Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9(1):1–14, 2017.
  27. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  28. Reinforced adversarial neural computer for de novo molecular design. Journal of chemical information and modeling, 58(6):1194–1204, 2018.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science, 4(1):120–131, 2018.
  31. Graphvae: Towards generation of small graphs using variational autoencoders. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27, pp.  412–422. Springer, 2018.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  33. Pmc-llama: Towards building open-source language models for medicine. arXiv preprint arXiv:2305.10415, 2023.
  34. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
  35. Interactive molecular discovery with natural language. arXiv preprint arXiv:2306.11976, 2023.
  36. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Geyan Ye (4 papers)
  2. Xibao Cai (3 papers)
  3. Houtim Lai (4 papers)
  4. Xing Wang (191 papers)
  5. Junhong Huang (3 papers)
  6. Longyue Wang (87 papers)
  7. Wei Liu (1135 papers)
  8. Xiangxiang Zeng (28 papers)
Citations (14)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com