When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings
Abstract: Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. These SMILES strings are used in different complex machine learning-based drug-related research and representation works. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification? Our experiments affirm the possibility with very competitive scores. The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types, proving that complex problems can also be solved with simpler perspectives. The data and code are available here: https://github.com/azminewasi/Drug-Classification-NLP.
- When biology has chemistry: Solubility and drug subcategory prediction using SMILES strings, 2023. URL https://openreview.net/forum?id=28si4RXwDt1.
- One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. Journal of Cheminformatics, 12(1), June 2020. ISSN 1758-2946. doi: 10.1186/s13321-020-00445-4. URL http://dx.doi.org/10.1186/s13321-020-00445-4.
- Artificial intelligence in drug discovery: applications and techniques. Briefings in Bioinformatics, 23(1):bbab430, 11 2021. ISSN 1477-4054. doi: 10.1093/bib/bbab430. URL https://doi.org/10.1093/bib/bbab430.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
- Objective-reinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint arXiv:1705.10843, 2017.
- Multi-class sentiment analysis of urdu text using multilingual bert. Scientific Reports, 12(1):5436, 2022.
- Application of smiles-based molecular generative model in new drug design. Frontiers in Pharmacology, 13, 2022. ISSN 1663-9812. doi: 10.3389/fphar.2022.1046524. URL https://www.frontiersin.org/articles/10.3389/fphar.2022.1046524.
- N-gram graph: Simple unsupervised representation for graphs, with applications to molecules. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8464–8476. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/9054-n-gram-graph-simple-unsupervised-representation-for-graphs-with-applications-to-molecules.pdf.
- Learning drug functions from chemical structures with convolutional neural networks and random forests. Journal of Chemical Information and Modeling, 59(10):4438–4449, Oct 2019. ISSN 1549-9596. doi: 10.1021/acs.jcim.9b00236. URL https://doi.org/10.1021/acs.jcim.9b00236.
- The transformational role of gpu computing and deep learning in drug discovery. Nature Machine Intelligence, 4(3):211–221, 2022.
- Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods. Journal of Cheminformatics, 5(1), September 2013. ISSN 1758-2946. doi: 10.1186/1758-2946-5-43. URL http://dx.doi.org/10.1186/1758-2946-5-43.
- A smile is all you need: predicting limiting activity coefficients from smiles with natural language processing. Digital Discovery, 1(6):859–869, 2022.
- DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res., 46(D1):D1074–D1082, January 2018.
- Learning to smiles: Ban-based strategies to improve latent representation learning from molecules. Briefings in Bioinformatics, 22(6):bbab327, 2021.
- Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry, 63(16):8749–8760, 2019.
- Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.