Mining Patents with Large Language Models Elucidates the Chemical Function Landscape (2309.08765v2)
Abstract: The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of orthogonal methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain functional labels for approximately 100K molecules from their corresponding 188K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an orthogonal approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
- Potent hepatitis c inhibitors bind directly to ns5a and reduce its affinity for rna. Scientific reports, 4(1):4765, 2014.
- Innovation in small-molecule-druggable chemical space: Where are the initial modulators of new targets published? Journal of chemical information and modeling, 57(11):2741–2753, 2017.
- Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of cheminformatics, 7(1):1–13, 2015.
- Gephi: an open source software for exploring and manipulating networks. In Proceedings of the international AAAI conference on web and social media, volume 3, pp. 361–362, 2009.
- Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Gene Ontology Consortium. The gene ontology (go) database and informatics resource. Nucleic acids research, 32(suppl_1):D258–D261, 2004.
- The effect of angiotensin-blocking agents on liver fibrosis in patients with hepatitis c. Liver International, 29(5):748–753, 2009.
- Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022.
- Chebi: a database and ontology for chemical entities of biological interest. Nucleic acids research, 36(suppl_1):D344–D350, 2007.
- David A Drachman. The amyloid hypothesis, time to move on: Amyloid is the downstream result, not cause, of alzheimer’s disease. Alzheimer’s & Dementia, 10(3):372–380, 2014.
- Text2mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 595–607, 2021.
- Translation between molecules and natural language. arXiv preprint arXiv:2204.11817, 2022.
- A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pp. 226–231, 1996.
- Pubchemrdf: towards the semantic annotation of pubchem compound and substance databases. Journal of cheminformatics, 7(1):1–15, 2015.
- Chemu 2020: natural language processing methods are effective for information extraction from chemical patents. Frontiers in Research Metrics and Analytics, 6:654438, 2021.
- Zero-shot prediction of therapeutic use with geometric deep learning and clinician centered design. medRxiv, pp. 2023–03, 2023.
- Pubchem 2023 update. Nucleic acids research, 51(D1):D1373–D1380, 2023.
- Using alternative smiles representations to identify novel functional analogues in chemical similarity vector searches. Patterns, 2023.
- Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 8:31, 2013.
- Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016, 2016.
- Mechanisms of action for small molecules revealed by structural biology in drug discovery. International journal of molecular sciences, 21(15):5262, 2020.
- Illuminating the druggable genome through patent bioactivity data. PeerJ, 11:e15153, 2023.
- Molecular similarity in medicinal chemistry: miniperspective. Journal of medicinal chemistry, 57(8):3186–3204, 2014.
- Do structurally similar molecules have similar biological activity? Journal of medicinal chemistry, 45(19):4350–4358, 2002.
- Hepatitis b virus and hepatitis c virus reactivation in cancer patients receiving novel anticancer therapies. Clinical Microbiology and Infection, 28(10):1321–1327, 2022.
- Directory of useful decoys, enhanced (dud-e): better ligands and decoys for better benchmarking. Journal of medicinal chemistry, 55(14):6582–6594, 2012.
- Open targets platform: supporting systematic drug–target identification and prioritisation. Nucleic acids research, 49(D1):D1302–D1310, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Surechembl: a large-scale, chemically annotated patent document database. Nucleic acids research, 44(D1):D1220–D1228, 2016.
- Neighborhood behavior: a useful concept for validation of “molecular diversity” descriptors. Journal of medicinal chemistry, 39(16):3049–3059, 1996.
- Stefan Senger. Assessment of the significance of patent-derived information for the early identification of compound–target interaction hypotheses. Journal of Cheminformatics, 9(1):1–8, 2017.
- Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents. Journal of cheminformatics, 7(1):1–12, 2015.
- A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022.
- Automated patent extraction powers generative modeling in focused chemical spaces. Digital Discovery, 2023.
- Fine-grained chemical entity typing with multimodal knowledge representation. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1984–1991. IEEE, 2021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455–461, 2010.
- David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
- Chemfont: the chemical functional ontology resource. Nucleic Acids Research, 51(D1):D1220–D1229, 2023.
- Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
- Sheng-Yong Yang. Pharmacophore modeling and applications in drug discovery: challenges and recent advances. Drug discovery today, 15(11-12):444–450, 2010.
- Chemtables: a dataset for semantic classification on tables in chemical patents. Journal of Cheminformatics, 13(1):1–20, 2021.