Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets (2310.04292v3)

Published 6 Oct 2023 in cs.LG

Abstract: Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Geom, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1):185, 2022.
  2. Mace: Higher order equivariant message passing neural networks for fast and accurate force fields. Advances in Neural Information Processing Systems, 35:11423–11436, 2022.
  3. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.
  4. Directional graph networks. In International Conference on Machine Learning, pp. 748–758. PMLR, 2021.
  5. Generative models for molecular discovery: Recent advances and challenges. Wiley Interdisciplinary Reviews: Computational Molecular Science, 12, 03 2022.
  6. Weisfeiler and lehman go cellular: Cw networks. Advances in Neural Information Processing Systems, 34:2625–2640, 2021.
  7. Guacamol: Benchmarking models for de novo molecular design. Journal of Chemical Information and Modeling, 59(3):1096–1108, March 2019. ISSN 1549-9596, 1549-960X. doi: 10.1021/acs.jcim.8b00839. arXiv:1811.09621 [physics, q-bio].
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020a.
  9. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020b.
  10. Cogdl: A comprehensive library for graph deep learning. In Proceedings of the ACM Web Conference 2023 (WWW’23), 2023.
  11. Open catalyst 2020 (oc20) dataset and community challenges. Acs Catalysis, 11(10):6059–6072, 2021.
  12. Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022.
  13. CSIRO’s Data61. Stellargraph machine learning library. https://github.com/stellargraph/stellargraph, 2018.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  15. Benchmarking graph neural networks. CoRR, abs/2003.00982, 2020. URL https://arxiv.org/abs/2003.00982.
  16. Graph neural networks with learnable structural and positional representations. arXiv preprint arXiv:2110.07875, 2021.
  17. Spice, a dataset of drug-like molecules and peptides for training machine learning potentials. Scientific Data, 10(1):11, 2023.
  18. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  19. Directional message passing for molecular graphs. In International Conference on Learning Representations.
  20. Gemnet: Universal directional graph neural networks for molecules. Advances in Neural Information Processing Systems, 34:6790–6802, 2021.
  21. Jraph: A library for graph neural networks in jax., 2020. URL http://github.com/deepmind/jraph.
  22. Extreme acceleration of graph neural network-based prediction models for quantum chemistry, 2022.
  23. SMILES transformer: Pre-trained molecular fingerprint for low data drug discovery. CoRR, abs/1911.04738, 2019. URL http://arxiv.org/abs/1911.04738.
  24. Efficient graph deep learning in tensorflow with tf_geometric. In Heng Tao Shen, Yueting Zhuang, John R. Smith, Yang Yang, Pablo Cesar, Florian Metze, and Balakrishnan Prabhakaran (eds.), MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, pp.  3775–3778. ACM, 2021a. doi: 10.1145/3474085.3478322. URL https://doi.org/10.1145/3474085.3478322.
  25. Strategies for pre-training graph neural networks. In International Conference on Learning Representations (ICLR), 2020a.
  26. Open graph benchmark: Datasets for machine learning on graphs. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020b. URL https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html.
  27. OGB-LSC: A large-scale challenge for machine learning on graphs. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021b. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/db8e1af0cb3aca1ae2d0018624204529-Abstract-round2.html.
  28. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. arXiv preprint arXiv:2102.09548, 2021.
  29. Zinc20—a free ultralarge-scale chemical database for ligand discovery. Journal of Chemical Information and Modeling, 60(12):6065–6073, December 2020. ISSN 1549-9596. doi: 10.1021/acs.jcim.0c00675.
  30. Dissecting the graphcore ipu architecture via microbenchmarking, 2019.
  31. nabladft: Large-scale conformational energy and hamiltonian prediction benchmark and dataset. Physical Chemistry Chemical Physics, 24(42):25853–25863, 2022.
  32. Pubchem substance and compound databases. Nucleic acids research, 44(D1):D1202–D1213, 2016.
  33. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=SJU4ayYgl.
  34. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.
  35. Combining structural and bioactivity-based fingerprints improves prediction performance and scaffold hopping capability. Journal of cheminformatics, 11:1–14, 2019.
  36. Pangu drug model: learn a molecule like a human. bioRxiv, pp.  2022–03, 2022.
  37. DIG: A turnkey library for diving into graph deep learning research. Journal of Machine Learning Research, 22(240):1–9, 2021. URL http://jmlr.org/papers/v22/21-0343.html.
  38. Molecular geometry pretraining with se(3)-invariant denoising distance matching, 2023a.
  39. Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852, 2023b.
  40. One transformer can understand both 2D & 3D molecular data. arXiv:2210.01765, 2022.
  41. GPS++: Reviving the art of message passing for molecular property prediction. arXiv preprint arXiv:2302.02947, 2023.
  42. Generating QM1b with PySCFipusubscriptPySCFipu\text{PySCF}_{\text{ipu}}PySCF start_POSTSUBSCRIPT ipu end_POSTSUBSCRIPT. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=9Z1cmO7S7o.
  43. Large-scale comparison of machine learning methods for drug target prediction on chembl. Chemical Science, 9(24):5441–5451, June 2018. ISSN 2041-6539. doi: 10.1039/C8SC00148K.
  44. Mole: a molecular foundation model for drug discovery. arXiv preprint arXiv:2211.02657, 2022.
  45. Pubchemqc project: a large-scale first-principles electronic structure database for data-driven chemistry. Journal of chemical information and modeling, 57(6):1300–1308, 2017.
  46. Pubchemqc pm6: Data sets of 221 million molecules with optimized molecular geometries and electronic properties. Journal of Chemical Information and Modeling, 60(12):5891–5899, 2020.
  47. Language models are unsupervised multitask learners. 2019.
  48. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014. Nature.
  49. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501–14515, 2022.
  50. Deep Learning for the Life Sciences. O’Reilly Media, 2019. https://www.amazon.com/Deep-Learning-Life-Sciences-Microscopy/dp/1492039837.
  51. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of Chemical Information and Modeling, 52(11):2864–2875, 2012. doi: 10.1021/ci300415d.
  52. Context-enriched molecule representations improve few-shot drug discovery. arXiv preprint arXiv:2305.09481, 2023.
  53. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016.
  54. Misato-machine learning dataset of protein-ligand complexes for structure-based drug discovery. bioRxiv, pp.  2023–05, 2023.
  55. Fs-mol: A few-shot learning dataset of molecules. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  56. 3d infomax improves gnns for molecular property prediction. In International Conference on Machine Learning, pp. 20479–20502. PMLR, 2022.
  57. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171(6):1437–1452, 2017.
  58. Does gnn pretraining help molecular representation? Advances in Neural Information Processing Systems, 35:12096–12109, 2022.
  59. Adversarial graph augmentation to improve graph contrastive learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 15920–15933, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/854f1fb6f65734d9e49f708d6cd84ad6-Abstract.html.
  60. Improving the human hazard characterization of chemicals: a tox21 update. Environmental health perspectives, 121(7):756–765, 2013.
  61. R. Todeschini and P. Gramatica. The whim theory: New 3d molecular descriptors for qsar in environmental modelling. SAR and QSAR in Environmental Research, 7(1-4):89–115, 1997. doi: 10.1080/10629369708039126. URL https://doi.org/10.1080/10629369708039126.
  62. Exposing the limitations of molecular machine learning with activity cliffs. Journal of Chemical Information and Modeling, 62:5938–5951, 2022. doi: 10.1021/acs.jcim.2c01073.
  63. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019.
  64. Drug-induced adverse events prediction with the lincs l1000 data. Bioinformatics, 32(15):2338–2345, 2016.
  65. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chemical Science, 10(6):1692–1701, February 2019. ISSN 2041-6539. doi: 10.1039/C8SC04175J.
  66. Pre-training of equivariant graph matching networks with conformation flexibility for drug binding. Advanced Science, 9(33):2203796, 2022.
  67. MoleculeNet: A benchmark for molecular machine learning. Chemical Science, 9:513–530, 2018.
  68. A systematic survey of chemical pre-trained models. IJCAI, 2023.
  69. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
  70. Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer. CoRR, abs/2203.03466, 2022. doi: 10.48550/arXiv.2203.03466. URL https://doi.org/10.48550/arXiv.2203.03466.
  71. Transfer learning or self-supervised learning? a tale of two pretraining paradigms. arXiv preprint arXiv:2007.04234, 2020.
  72. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877–28888, 2021.
  73. Design space for graph neural networks. In NeurIPS, 2020.
  74. Pre-training via denoising for molecular property prediction. CoRR, abs/2206.00133, 2022. doi: 10.48550/arXiv.2206.00133. URL https://doi.org/10.48550/arXiv.2206.00133.
  75. Uni-mol: A universal 3d molecular representation learning framework. 2023.
  76. Torchdrug: A powerful and flexible machine learning platform for drug discovery. CoRR, abs/2202.08320, 2022. URL https://arxiv.org/abs/2202.08320.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (35)
  1. Dominique Beaini (27 papers)
  2. Shenyang Huang (48 papers)
  3. Joao Alex Cunha (1 paper)
  4. Zhiyi Li (20 papers)
  5. Gabriela Moisescu-Pareja (2 papers)
  6. Oleksandr Dymov (1 paper)
  7. Samuel Maddrell-Mander (3 papers)
  8. Callum McLean (2 papers)
  9. Frederik Wenkel (14 papers)
  10. Luis Müller (7 papers)
  11. Jama Hussein Mohamud (4 papers)
  12. Ali Parviz (10 papers)
  13. Michael Craig (6 papers)
  14. Michał Koziarski (20 papers)
  15. Jiarui Lu (31 papers)
  16. Zhaocheng Zhu (22 papers)
  17. Cristian Gabellini (4 papers)
  18. Kerstin Klaser (11 papers)
  19. Josef Dean (5 papers)
  20. Cas Wognum (3 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com