DrugLLM: Open Large Language Model for Few-shot Molecule Generation (2405.06690v1)
Abstract: LLMs have made great strides in areas such as language processing and computer vision. Despite the emergence of diverse techniques to improve few-shot learning capacity, current LLMs fall short in handling the languages in biology and chemistry. For example, they are struggling to capture the relationship between molecule structure and pharmacochemical properties. Consequently, the few-shot learning capacity of small-molecule drug modification remains impeded. In this work, we introduced DrugLLM, a LLM tailored for drug design. During the training process, we employed Group-based Molecular Representation (GMR) to represent molecules, arranging them in sequences that reflect modifications aimed at enhancing specific molecular properties. DrugLLM learns how to modify molecules in drug discovery by predicting the next molecule based on past modifications. Extensive computational experiments demonstrate that DrugLLM can generate new molecules with expected properties based on limited examples, presenting a powerful few-shot molecule generation capacity.
- Gpt-4 technical report, arXiv preprint arXiv:2303.08774 .
- Estimating the cost of new drug development: is it really $802 million?, Health Aff (Millwood) 25(2): 420–428.
- Ligand-based virtual screening procedure for the prediction and the identification of novel β𝛽\betaitalic_β-amyloid aggregation inhibitors using kohonen maps and counterpropagation artificial neural networks, Eur. J. Med. Chem. 46(2): 497–508.
- Few-shot training llms for project-specific code-summarization, Proc. IEEE/ACM Int. Conf. Autom. Softw. Eng., pp. 1–5.
- Application of generative autoencoder in de novo molecular design, Mol Inform 37(1-2): 1700123.
- A decade of fda-approved drugs (2010–2019): trends and future directions, J. Med. Chem. 64(5): 2312–2338.
- Language models are few-shot learners, Adv. Condens. Matter Phys., Vol. 33, pp. 1877–1901.
- A survey on evaluation of large language models, ACM Trans Intell Syst Technol 15(3): 1–45.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, URL https://lmsys. org/blog/2023-03-30-vicuna 3(5).
- Crunkhorn, S. (2022). Screening ultra-large virtual libraries, Nat. Rev. Drug Discov 21(95): 10–1038.
- Chembl web services: streamlining access to drug discovery data and utilities, Nucleic Acids Res. 43(W1): W612–W620.
- The cost of new drug discovery and development, Discov Med 4(22): 172–179.
- Sprout, hippo and caesa: Tools for de novo structure generation and estimation of synthetic accessibility, Perspect. Drug Discovery Des. 3: 34–50.
- Junction tree variational autoencoder for molecular graph generation, ICML, pp. 2323–2332.
- Hierarchical generation of molecular graphs using structural motifs, ICML, pp. 4839–4848.
- Generative and reinforcement learning approaches for the automated de novo design of bioactive compounds, Commun Chem 5: 129.
- Small molecule approaches to targeting rna, Nat. Rev. Chem. pp. 1–16.
- Druggpt: A gpt-based strategy for designing potential ligands targeting specific proteins, bioRxiv pp. 2023–06.
- Multi-objective de novo drug design with conditional graph generative model, J. Cheminformatics 10: 1–24.
- A chance-constrained generative framework for sequence optimization, ICML, pp. 6271–6281.
- Biogpt: generative pre-trained transformer for biomedical text generation and mining, Brief. Bioinformatics 23(6): bbac409.
- Modeling the expansion of virtual screening libraries, Nat Chem Biol 19: 712–718.
- Few-shot learning creates predictive models of drug response that translate from high-throughput screens to individual patients, Nat Cancer 2(2): 233–244.
- Alpaca: Intermittent execution without checkpoints, PACMPL 1(OOPSLA): 1–30.
- Learning to extend molecular scaffolds with structural motifs, arXiv preprint arXiv:2103.03864 .
- Chembl: towards direct deposition of bioassay data, Nucleic Acids Res. 47(D1): D930–D940.
- Fda approvals in 2023: biomarker-positive subsets, equipoise and verification of benefit, Nat Rev Clin Oncol pp. 1–2.
- Large-scale chemoproteomics expedites ligand discovery and predicts ligand behavior in cells, Science 384(6694): eadk5864.
- O’Boyle, N. M. (2012). Towards a universal smiles representation-a standard method to generate canonical smiles based on the inchi, J. Cheminformatics 4: 1–14.
- Popova, M. et al. (2018). Deep reinforcement learning for de novo drug design, Sci. Adv. 4: eaap7885.
- Reinforced adversarial neural computer for de novo molecular design, J Chem Inf Model 58(6): 1194–1204.
- A small-molecule tnik inhibitor targets fibrosis in preclinical and clinical models, Nature Biotechnology pp. 1–13.
- Large-scale chemical language representations capture molecular structure and properties, Nature Machine Intelligence 4(12): 1256–1264.
- Synthon-based ligand discovery in virtual libraries of over 11 billion compounds, Nature 601(7893): 452–459.
- Small molecules, big targets: drug discovery faces the protein–protein interaction challenge, Nat. Rev. Drug Discov. 15(8): 533–550.
- Generating focused molecule libraries for drug discovery with recurrent neural networks, ACS Cent. Sci. 4(1): 120–131.
- Neural machine translation of rare words with subword units, ACL, pp. 1715–1725.
- FS-mol: A few-shot learning dataset of molecules, NeurIPS.
- Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 .
- Integrating qsar modelling and deep learning in drug discovery: the emergence of deep qsar, Nature Reviews Drug Discovery 23(2): 141–155.
- The evolution of commercial drug delivery technologies, Nat Biomed Eng 5: 951–967.
- Attention is all you need, Adv. Condens. Matter Phys. 30.
- Multitask joint strategies of self-supervised representation learning on biomedical networks for drug discovery, Nat Mach Intell 5: 445–456.
- Pro_ligand: an approach to de novo molecular design. 2. design of novel molecules from molecular field analysis (mfa) models and pharmacophores, J. Med. Chem. 37(23): 3994–4002.
- Moleculenet: a benchmark for molecular machine learning, Chem. Sci. 9(2): 513–530.
- Glm-130b: An open bilingual pre-trained model, arXiv preprint arXiv:2210.02414 .
- Human-level few-shot concept induction through minimax entropy learning, Sci. Adv. 10(16): eadg2488.
- Xianggen Liu (16 papers)
- Yan Guo (72 papers)
- Haoran Li (166 papers)
- Jin Liu (151 papers)
- Shudong Huang (14 papers)
- Bowen Ke (1 paper)
- Jiancheng Lv (99 papers)