Papers
Topics
Authors
Recent
2000 character limit reached

FARM: Functional Group-Aware Representations for Small Molecules (2410.02082v2)

Published 2 Oct 2024 in cs.LG and q-bio.QM

Abstract: We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group-aware tokenization, which directly incorporates functional group information into the representations. This strategic reduction in tokenization granularity is intentionally aligned with key drivers of functional properties (i.e., functional groups), enhancing the model's understanding of chemical language. By expanding the chemical lexicon, FARM more effectively bridges SMILES and natural language, ultimately advancing the model's capacity to predict molecular properties. FARM also represents molecules from two perspectives: by using masked language modeling to capture atom-level features and by employing graph neural networks to encode the whole molecule topology. By leveraging contrastive learning, FARM aligns these two views of representations into a unified molecular embedding. We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 10 out of 12 tasks. These results highlight FARM's potential to improve molecular representation learning, with promising applications in drug discovery and pharmaceutical research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Patnala GR Achary. Applications of quantitative structure-activity relationships (qsar) based virtual screening in drug design: A review. Mini Reviews in Medicinal Chemistry, 20:1375–1388, 2020.
  2. Drug–target interaction prediction: databases, web servers and computational models. Briefings in bioinformatics, 17:696–712, 2016.
  3. Hight: Hierarchical graph tokenization for graph-language alignment. arXiv preprint arXiv:2406.14021, 2024.
  4. On the art of compiling and using’drug-like’chemical fragment spaces. ChemMedChem, 3:1503, 2008.
  5. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  6. Translation between molecules and natural language. In Proc. The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP2022), 2022.
  7. Synergpt: In-context learning for personalized drug synergy prediction and drug design. In Proc. 1st Conference on Language Modeling (COLM2024), 2024a.
  8. L+m-24: Building a dataset for language + molecules acl 2024. In Proc. ACL 2024 Workshop on Language+Molecules, 2024b.
  9. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022.
  10. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nature Machine Intelligence, 5(5):542–553, 2023.
  11. Chembl: A large-scale bioactivity database for chemical biology and drug discovery. Nucleic Acids Research. Database, pp.  D1, 2011.
  12. Himgnn: a novel hierarchical molecular graph representation learning framework for property prediction. Briefings in Bioinformatics, 24(5):bbad305, 2023.
  13. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265, 2019.
  14. Zinc- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45:177–182, 2005.
  15. G Landrum. Rdkit: Open-source cheminformatics. https://www.rdkit.org, 2010. Accessed: 2024-09-19.
  16. Fg-bert: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction. Briefings in Bioinformatics, 24(6):bbad398, 2023.
  17. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728, 2021.
  18. Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices. In Proc. 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024), 2024a.
  19. Glad: Synergizing molecular graphs and language descriptors for enhanced power conversion efficiency prediction in organic photovoltaic devices. 33rd ACM International Conference on Information and Knowledge Management, 2024b.
  20. Smiclr: contrastive learning on multiple molecular representations for semisupervised and unsupervised representation learning. Journal of Chemical Information and Modeling, 62(17):3948–3960, 2022.
  21. Alec Radford. Improving language understanding by generative pre-training. OpenAI, 2018.
  22. Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33:12559–12571, 2020.
  23. Molecular property prediction: recent trends in the era of artificial intelligence. Drug Discovery Today: Technologies, 32:29–36, 2019.
  24. Complex embeddings for simple link prediction. In International conference on machine learning, pp.  2071–2080. PMLR, 2016.
  25. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  26. Applications of deep learning in molecule generation and molecular property prediction. Accounts of chemical research, 54:263–270, 2020.
  27. Chemical-reaction-aware molecule representation learning. In Proc. The International Conference on Learning Representations (ICLR2022), 2022a.
  28. Motif-based graph representation learning with application to chemical molecules. In Informatics, volume 10, pp.  8. MDPI, 2023.
  29. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3):279–287, 2022b.
  30. Deep-learning-based drug–target interaction prediction. Journal of proteome research, 16:1401–1409, 2017.
  31. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  32. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  5312–5320, 2023.
  33. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9:513–530, 2018.
  34. Mole-bert: Rethinking pre-training graph neural networks for molecules. The Eleventh International Conference on Learning Representations, ICLR 2023, 2023.
  35. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry, 63:8749–8760, 2019.
  36. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
  37. Invariant tokenization for language model enabled crystal materials generation. In Proc. the Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS2024), 2024.
  38. Learning substructure invariance for out-of-distribution molecular representations. Advances in Neural Information Processing Systems, 35:12964–12978, 2022.
  39. Molecular representation learning via heterogeneous motif graph neural networks. In International Conference on Machine Learning, pp.  25581–25594. PMLR, 2022.
  40. Hierarchical molecular graph self-supervised learning for property prediction. Communications Chemistry, 6(1):34, 2023.
  41. Motif-driven contrastive learning of graph representations. arXiv preprint arXiv:2012.12533, 2020.
  42. Artificial intelligence for science in quantum, atomistic, and continuum systems. In arxiv, 2023a.
  43. Artificial intelligence for science in quantum, atomistic, and continuum systems. arXiv preprint arXiv:2307.08423, 2023b.
  44. Motif-based graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems, 34:15870–15882, 2021.
  45. Uni-mol: A universal 3d molecular representation learning framework. The Eleventh International Conference on Learning Representations, ICLR 2023, 2023.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.