Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mining Patents with Large Language Models Elucidates the Chemical Function Landscape (2309.08765v2)

Published 15 Sep 2023 in q-bio.QM and cs.LG

Abstract: The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of orthogonal methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain functional labels for approximately 100K molecules from their corresponding 188K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an orthogonal approach to traditional structure-based methods in the pursuit of designing novel functional molecules.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Potent hepatitis c inhibitors bind directly to ns5a and reduce its affinity for rna. Scientific reports, 4(1):4765, 2014.
  2. Innovation in small-molecule-druggable chemical space: Where are the initial modulators of new targets published? Journal of chemical information and modeling, 57(11):2741–2753, 2017.
  3. Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of cheminformatics, 7(1):1–13, 2015.
  4. Gephi: an open source software for exploring and manipulating networks. In Proceedings of the international AAAI conference on web and social media, volume 3, pp.  361–362, 2009.
  5. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Gene Ontology Consortium. The gene ontology (go) database and informatics resource. Nucleic acids research, 32(suppl_1):D258–D261, 2004.
  8. The effect of angiotensin-blocking agents on liver fibrosis in patients with hepatitis c. Liver International, 29(5):748–753, 2009.
  9. Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022.
  10. Chebi: a database and ontology for chemical entities of biological interest. Nucleic acids research, 36(suppl_1):D344–D350, 2007.
  11. David A Drachman. The amyloid hypothesis, time to move on: Amyloid is the downstream result, not cause, of alzheimer’s disease. Alzheimer’s & Dementia, 10(3):372–380, 2014.
  12. Text2mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  595–607, 2021.
  13. Translation between molecules and natural language. arXiv preprint arXiv:2204.11817, 2022.
  14. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pp.  226–231, 1996.
  15. Pubchemrdf: towards the semantic annotation of pubchem compound and substance databases. Journal of cheminformatics, 7(1):1–15, 2015.
  16. Chemu 2020: natural language processing methods are effective for information extraction from chemical patents. Frontiers in Research Metrics and Analytics, 6:654438, 2021.
  17. Zero-shot prediction of therapeutic use with geometric deep learning and clinician centered design. medRxiv, pp.  2023–03, 2023.
  18. Pubchem 2023 update. Nucleic acids research, 51(D1):D1373–D1380, 2023.
  19. Using alternative smiles representations to identify novel functional analogues in chemical similarity vector searches. Patterns, 2023.
  20. Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 8:31, 2013.
  21. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016, 2016.
  22. Mechanisms of action for small molecules revealed by structural biology in drug discovery. International journal of molecular sciences, 21(15):5262, 2020.
  23. Illuminating the druggable genome through patent bioactivity data. PeerJ, 11:e15153, 2023.
  24. Molecular similarity in medicinal chemistry: miniperspective. Journal of medicinal chemistry, 57(8):3186–3204, 2014.
  25. Do structurally similar molecules have similar biological activity? Journal of medicinal chemistry, 45(19):4350–4358, 2002.
  26. Hepatitis b virus and hepatitis c virus reactivation in cancer patients receiving novel anticancer therapies. Clinical Microbiology and Infection, 28(10):1321–1327, 2022.
  27. Directory of useful decoys, enhanced (dud-e): better ligands and decoys for better benchmarking. Journal of medicinal chemistry, 55(14):6582–6594, 2012.
  28. Open targets platform: supporting systematic drug–target identification and prioritisation. Nucleic acids research, 49(D1):D1302–D1310, 2021.
  29. OpenAI. Gpt-4 technical report, 2023.
  30. Surechembl: a large-scale, chemically annotated patent document database. Nucleic acids research, 44(D1):D1220–D1228, 2016.
  31. Neighborhood behavior: a useful concept for validation of “molecular diversity” descriptors. Journal of medicinal chemistry, 39(16):3049–3059, 1996.
  32. Stefan Senger. Assessment of the significance of patent-derived information for the early identification of compound–target interaction hypotheses. Journal of Cheminformatics, 9(1):1–8, 2017.
  33. Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents. Journal of cheminformatics, 7(1):1–12, 2015.
  34. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022.
  35. Automated patent extraction powers generative modeling in focused chemical spaces. Digital Discovery, 2023.
  36. Fine-grained chemical entity typing with multimodal knowledge representation. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.  1984–1991. IEEE, 2021.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  38. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455–461, 2010.
  39. David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
  40. Chemfont: the chemical functional ontology resource. Nucleic Acids Research, 51(D1):D1220–D1229, 2023.
  41. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
  42. Sheng-Yong Yang. Pharmacophore modeling and applications in drug discovery: challenges and recent advances. Drug discovery today, 15(11-12):444–450, 2010.
  43. Chemtables: a dataset for semantic classification on tables in chemical patents. Journal of Cheminformatics, 13(1):1–20, 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.