DrugAssist: A Large Language Model for Molecule Optimization (2401.10334v1)

Published 28 Dec 2023 in q-bio.QM, cs.AI, cs.CL, and cs.LG

Abstract: Recently, the impressive performance of LLMs on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback. These non-interactive approaches overlook the fact that the drug discovery process is actually one that requires the integration of expert experience and iterative refinement. To address this gap, we propose DrugAssist, an interactive molecule optimization model which performs optimization through human-machine dialogue by leveraging LLM's strong interactivity and generalizability. DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization. In addition, we publicly release a large instruction-based dataset called MolOpt-Instructions for fine-tuning LLMs on molecule optimization tasks. We have made our code and data publicly available at https://github.com/blazerye/DrugAssist, which we hope to pave the way for future research in LLMs' application for drug discovery.

PDF HTML Abstract

DrugAssist: A LLM for Molecule Optimization

The paper "DrugAssist: A LLM for Molecule Optimization" introduces an innovative approach to addressing a gap in the application of LLMs within the domain of drug discovery, specifically focusing on molecule optimization. The advancement of LLMs has largely influenced an array of fields; however, their applicability to molecule optimization has not been comprehensively explored. The authors present DrugAssist, a model that leverages the interactive capabilities of LLMs to enhance molecule optimization through human-machine dialogue.

Contributions

The paper's contributions are multifaceted:

Interactive Molecule Optimization Model: DrugAssist emerges as an interactive model that incorporates human feedback in optimizing molecular structures. The emphasis on dialogue-based interaction is a distinct departure from existing non-interactive methodologies, which typically isolate the optimization problem from expert feedback loops.
MolOpt-Instructions Dataset: The creation and release of the "MolOpt-Instructions" dataset constitute a significant step forward. This dataset provides a robust foundation for fine-tuning LLMs for molecule optimization tasks, offering a substantial collection of molecule pairs with diverse property differences and similarity constraints.
Empirical Performance: The paper provides evidence of DrugAssist’s performance through rigorous evaluation. The model achieves leading results in tasks involving the optimization of multiple molecular properties, addressing the real-world requirement to maintain optimized property values within specified ranges.

Methodology

The methodology underscores the creation of the MolOpt-Instructions dataset and the instruction tuning of the Llama2-7B-Chat model. The dataset boasts over a million molecule pairs, integrating various molecular properties pertinent to drug development. The instruction tuning is executed via multi-task learning to counteract phenomena such as catastrophic forgetting, ensuring that the model retains its general capabilities while honing its molecule-specific skills.

Results and Comparison

In comparative analyses, DrugAssist exhibits superior performance over traditional sequence-based approaches, such as Seq2Seq and Transformer models, by achieving higher success rates for solubility and BBBP optimization tasks. Furthermore, the model showcases advanced capabilities in iterative optimization and property transferability. When juxtaposed with other LLM implementations, including popular models like GPT-3.5-turbo, DrugAssist demonstrates superior ability to adaptively meet task requirements through interactive dialogue. Its capacity for iterative and multi-property optimization delineates a practical alignment with real-world pharmaceutical demands.

Implications and Future Directions

The implications of DrugAssist's development span both practical and theoretical realms. Practically, the introduction of an LLM capable of interactive optimization could significantly streamline the drug discovery pipeline, fostering more efficient integration of computational and expert-driven processes. Theoretically, this work poses interesting questions regarding the broader applicability of interactive LLM frameworks beyond molecule optimization.

Future research might explore the integration of multimodal data handling, further enhancing the model's interaction capabilities and potentially broadening its application scope within biomedical domains. Additionally, addressing issues related to model hallucinations and response accuracy can potentiate the optimization capabilities of DrugAssist.

In conclusion, DrugAssist represents a notable progression in the application of LLMs to molecular science, harnessing the power of interactive AI to refine and optimize drug discovery processes through human-centric approaches. The publicly available dataset and model encourage further exploration and development within this exciting intersection of machine learning and chemistry.

PDF Markdown Bookmark Chat (Pro)

References (36)

Authors (8)

Geyan Ye (4 papers)
Xibao Cai (3 papers)
Houtim Lai (4 papers)
Xing Wang (191 papers)
Junhong Huang (3 papers)
Longyue Wang (87 papers)
Wei Liu (1135 papers)
Xiangxiang Zeng (28 papers)

Citations (14)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - blazerye/DrugAssist (134 stars)

Tweets

https://twitter.com/knishimae0531/status/1749945611707932798