Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contextual Molecule Representation Learning from Chemical Reaction Knowledge (2402.13779v1)

Published 21 Feb 2024 in cs.LG, cs.AI, and q-bio.BM

Abstract: In recent years, self-supervised learning has emerged as a powerful tool to harness abundant unlabelled data for representation learning and has been broadly adopted in diverse areas. However, when applied to molecular representation learning (MRL), prevailing techniques such as masked sub-unit reconstruction often fall short, due to the high degree of freedom in the possible combinations of atoms within molecules, which brings insurmountable complexity to the masking-reconstruction paradigm. To tackle this challenge, we introduce REMO, a self-supervised learning framework that takes advantage of well-defined atom-combination rules in common chemistry. Specifically, REMO pre-trains graph/Transformer encoders on 1.7 million known chemical reactions in the literature. We propose two pre-training objectives: Masked Reaction Centre Reconstruction (MRCR) and Reaction Centre Identification (RCI). REMO offers a novel solution to MRL by exploiting the underlying shared patterns in chemical reactions as \textit{context} for pre-training, which effectively infers meaningful representations of common chemistry knowledge. Such contextual representations can then be utilized to support diverse downstream molecular tasks with minimum finetuning, such as affinity prediction and drug-drug interaction prediction. Extensive experimental results on MoleculeACE, ACNet, drug-drug interaction (DDI), and reaction type classification show that across all tested downstream tasks, REMO outperforms the standard baseline of single-molecule masked modeling used in current MRL. Remarkably, REMO is the pioneering deep learning model surpassing fingerprint-based methods in activity cliff benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Pre-training transformers for molecular property prediction using reaction prediction. In ICML 2022 2nd AI for Science Workshop, 2022. URL https://openreview.net/forum?id=96HIX0DoK0S.
  2. Muffin: multi-scale feature fusion for drug–drug interaction prediction. Bioinformatics, 37(17):2651–2658, 2021.
  3. Chemberta: Large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885, 2020.
  4. A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci., 10:370–377, 2019. doi: 10.1039/C8SC04228D. URL http://dx.doi.org/10.1039/C8SC04228D.
  5. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA, 2019.
  6. Entropy, Relative Entropy, and Mutual Information, volume 2, pages 13–55. 04 2005. ISBN 9780471241959. doi: 10.1002/047174882X.ch2.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  8. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022.
  9. J. R. Firth. A synopsis of linguistic theory, 1930-1955. 1957.
  10. Correction to automated chemical reaction extraction from scientific literature. Journal of Chemical Information and Modeling, 61(8):4124–4124, 2021.
  11. Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738, 2019.
  12. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 594–604, 2022.
  13. Strategies for pre-training graph neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJlWWJSFDH.
  14. Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021.
  15. Zinc20—a free ultralarge-scale chemical database for ligand discovery. Journal of chemical information and modeling, 60(12):6065–6073, 2020.
  16. Predicting organic reaction outcomes with weisfeiler-lehman network. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 2604–2613, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  17. Pangu drug model: learn a molecule like a human. bioRxiv, pages 2022–03, 2022.
  18. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728, 2021.
  19. Daniel Lowe. Chemical reactions from us patents (1976-sep2016). https://doi.org/10.6084/m9.figshare.5104873.v1, 2017. Dataset.
  20. Molecule attention transformer. arXiv preprint arXiv:2002.08264, 2020.
  21. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
  22. Reaction classification and yield prediction using the differential reaction fingerprint drfp. 07 2021. doi: 10.33774/chemrxiv-2021-mc870.
  23. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33, 2020.
  24. Molformer: Large scale chemical language representations capture molecular structure and properties. 2022.
  25. Deep learning improves prediction of drug–drug and drug–food interactions. Proceedings of the National Academy of Sciences, 115(18):E4304–E4311, 2018.
  26. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. Journal of Chemical Information and Modeling, 55(1):39–53, 2015. doi: 10.1021/ci5006614. URL https://doi.org/10.1021/ci5006614.
  27. Mapping the space of chemical reactions using attention-based neural networks. Nature Machine Intelligence, 3(2):144–152, Feb 2021. ISSN 2522-5839. doi: 10.1038/s42256-020-00284-w. URL https://doi.org/10.1038/s42256-020-00284-w.
  28. A graph to graphs framework for retrosynthesis prediction. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  29. Evolving concept of activity cliffs. ACS Omega, 4(11):14360–14368, 2019. ISSN 2470-1343. doi: 10.1021/acsomega.9b02221. URL https://doi.org/10.1021/acsomega.9b02221.
  30. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067–1077, 2015.
  31. Exposing the limitations of molecular machine learning with activity cliffs. Journal of chemical information and modeling, 62(23):5938–5951, 2022. doi: 10.1021/acs.jcim.2c01073.
  32. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  33. Organic Chemistry: Structure and Function. W. H. Freeman, 8 edition, 2018. ISBN 1319079458.
  34. Chemical-reaction-aware molecule representation learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=6sh3pIzKS-.
  35. Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining. Chem. Sci., 13:1446–1458, 2022. doi: 10.1039/D1SC06515G. URL http://dx.doi.org/10.1039/D1SC06515G.
  36. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research, 46(D1):D1074–D1082, 2018.
  37. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
  38. Mole-bert: Rethinking pre-training graph neural networks for molecules. 2023.
  39. How powerful are graph neural networks? CoRR, abs/1810.00826, 2018. URL http://arxiv.org/abs/1810.00826.
  40. Self-supervised graph-level representation learning with local and global structure. In International Conference on Machine Learning, pages 11548–11558. PMLR, 2021.
  41. Analyzing learned molecular representations for property prediction. Journal of Chemical Information and Modeling, 59:3370 – 3388, 2019. URL https://api.semanticscholar.org/CorpusID:198986021.
  42. Do transformers really perform badly for graph representation? In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=OeWooOxFwDa.
  43. Mg-bert: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in bioinformatics, 22(6):bbab152, 2021.
  44. Can pre-trained models really learn better molecular representations for ai-aided drug discovery? arXiv preprint arXiv:2209.07423, 2022.
  45. Activity cliff prediction: Dataset and benchmark. arXiv preprint arXiv:2302.07541, 2023.
  46. Uni-mol: A universal 3d molecular representation learning framework. 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Han Tang (3 papers)
  2. Shikun Feng (37 papers)
  3. Bicheng Lin (1 paper)
  4. Yuyan Ni (14 papers)
  5. Wei-Ying Ma (39 papers)
  6. Yanyan Lan (87 papers)
  7. Jingjing Liu (139 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com