Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformers for molecular property prediction: Lessons learned from the past five years (2404.03969v1)

Published 5 Apr 2024 in cs.LG, cs.CL, and q-bio.QM

Abstract: Molecular Property Prediction (MPP) is vital for drug discovery, crop protection, and environmental science. Over the last decades, diverse computational techniques have been developed, from using simple physical and chemical properties and molecular fingerprints in statistical models and classical machine learning to advanced deep learning approaches. In this review, we aim to distill insights from current research on employing transformer models for MPP. We analyze the currently available models and explore key questions that arise when training and fine-tuning a transformer model for MPP. These questions encompass the choice and scale of the pre-training data, optimal architecture selections, and promising pre-training objectives. Our analysis highlights areas not yet covered in current research, inviting further exploration to enhance the field's understanding. Additionally, we address the challenges in comparing different models, emphasizing the need for standardized data splitting and robust statistical analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 2012, 25
  2. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016; pp 770–778
  3. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30
  4. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Advances in neural information processing systems 2014, 27
  5. Rezende, D.; Mohamed, S. Variational inference with normalizing flows. International conference on machine learning. 2015; pp 1530–1538
  6. Pang, C.; Qiao, J.; Zeng, X.; Zou, Q.; Wei, L. Deep generative models in de novo drug molecule generation. Journal of Chemical Information and Modeling 2023,
  7. Stanley, M.; Bronskill, J. F.; Maziarz, K.; Misztela, H.; Lanini, J.; Segler, M.; Schneider, N.; Brockschmidt, M. Fs-mol: A few-shot learning dataset of molecules. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021
  8. Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-supervised Learning: Generative or Contrastive. IEEE Transactions on Knowledge & Data Engineering 2021, 1–1
  9. Wang, S.; Guo, Y.; Wang, Y.; Sun, H.; Huang, J. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics. 2019; pp 429–436
  10. Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885 2020,
  11. Fabian, B.; Edlich, T.; Gaspar, H.; Segler, M.; Meyers, J.; Fiscato, M.; Ahmed, M. Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230 2020,
  12. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 2018,
  13. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 2014,
  14. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; others Language models are unsupervised multitask learners.
  15. Google Scholar. https://scholar.google.com/, [Online; accessed 26-February-2024]
  16. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 2019,
  17. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R. R.; Le, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 2019, 32
  18. Honda, S.; Shi, S.; Ueda, H. R. Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738 2019,
  19. Maziarka, Ł.; Danel, T.; Mucha, S.; Rataj, K.; Tabor, J.; Jastrzębski, S. Molecule attention transformer. arXiv preprint arXiv:2002.08264 2020,
  20. Ahmad, W.; Simon, E.; Chithrananda, S.; Grand, G.; Ramsundar, B. Chemberta-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712 2022,
  21. Yüksel, A.; Ulusoy, E.; Ünlü, A.; Doğan, T. Selformer: Molecular representation learning via selfies language models. Machine Learning: Science and Technology 2023,
  22. Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. International conference on machine learning. 2020; pp 5156–5165
  23. Su, J.; Lu, Y.; Pan, S.; Wen, B.; Liu, Y. RoFormer: Enhanced Transformer with Rotary Position Embedding. CoRR abs/2104.09864 (2021). arXiv preprint arXiv:2104.09864 2021,
  24. Enamine Real Database. https://enamine.net/compound-collections/real-compounds/real-database, [Online; accessed 20-November-2023]
  25. Zdrazil, B.; Felix, E.; Hunter, F.; Manners, E. J.; Blackshaw, J.; Corbett, S.; de Veij, M.; Ioannidis, H.; Mendez Lopez, D.; Mosquera, J. F.; others The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Research 2023, gkad1004
  26. Babuji, Y.; Blaiszik, B.; Brettin, T.; Chard, K.; Chard, R.; Clyde, A.; Foster, I.; Hong, Z.; Jha, S.; Li, Z.; others Targeting sars-cov-2 with ai-and hpc-enabled lead generation: A first data release. arXiv preprint arXiv:2006.02431 2020,
  27. Huang, K.; Fu, T.; Gao, W.; Zhao, Y.; Roohani, Y.; Leskovec, J.; Coley, C. W.; Xiao, C.; Sun, J.; Zitnik, M. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. arXiv preprint arXiv:2102.09548 2021,
  28. Hersey, A. ChEMBL Deposited Data Set - AZ dataset. 2015; https://doi.org/10.6019/chembl3301361
  29. Fulda, S.; Gorman, A. M.; Hori, O.; Samali, A.; others Cellular stress responses: cell survival and cell death. International journal of cell biology 2010, 2010
  30. National Cancer Institute AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data, Accessed: 22.01.2024
  31. Midway, S. R. Principles of effective data visualization. Patterns 2020, 1
  32. Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; Dahl, G. E. Neural message passing for quantum chemistry. International conference on machine learning. 2017; pp 1263–1272
  33. Albalak, A.; Elazar, Y.; Xie, S. M.; Longpre, S.; Lambert, N.; Wang, X.; Muennighoff, N.; Hou, B.; Pan, L.; Jeong, H.; others A Survey on Data Selection for Language Models. arXiv preprint arXiv:2402.16827 2024,
  34. Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 2020,
  35. Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D. d. L.; Hendricks, L. A.; Welbl, J.; Clark, A.; others Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 2022,
  36. Krenn, M.; Ai, Q.; Barthel, S.; Carson, N.; Frei, A.; Frey, N. C.; Friederich, P.; Gaudin, T.; Gayle, A. A.; Jablonka, K. M.; others SELFIES and the future of molecular string representations. Patterns 2022, 3
  37. Mielke, S. J.; Alyafeai, Z.; Salesky, E.; Raffel, C.; Dey, M.; Gallé, M.; Raja, A.; Si, C.; Lee, W. Y.; Sagot, B.; others Between words and characters: a brief history of open-vocabulary modeling and tokenization in nlp. arXiv preprint arXiv:2112.10508 2021,
  38. Schwaller, P.; Gaudin, T.; Lanyi, D.; Bekas, C.; Laino, T. “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chemical science 2018, 9, 6091–6098
  39. Creutz, M.; Lagus, K.; Virpioja, S. Unsupervised morphology induction using morfessor. International Workshop on Finite-State Methods and Natural Language Processing. 2005; pp 300–301
  40. Yun, C.; Bhojanapalli, S.; Rawat, A. S.; Reddi, S. J.; Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077 2019,
  41. Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J. G.; Le, Q.; Salakhutdinov, R. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019; pp 2978–2988
  42. Tay, Y.; Dehghani, M.; Rao, J.; Fedus, W.; Abnar, S.; Chung, H. W.; Narang, S.; Yogatama, D.; Vaswani, A.; Metzler, D. Scale efficiently: Insights from pre-training and fine-tuning transformers. arXiv preprint arXiv:2109.10686 2021,
  43. Payne, J.; Srouji, M.; Yap, D. A.; Kosaraju, V. Bert learns (and teaches) chemistry. arXiv preprint arXiv:2007.16012 2020,
  44. Peters, M. E.; Ruder, S.; Smith, N. A. To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987 2019,
  45. Zhang, Z.; Bian, Y.; Xie, A.; Han, P.; Huang, L.-K.; Zhou, S. Can Pre-trained Models Really Learn Better Molecular Representations for AI-aided Drug Discovery? arXiv preprint arXiv:2209.07423 2022,
  46. Romano, J. P. Testing statistical hypotheses; Springer, Vol. 3
  47. Kaddour, J.; Harris, J.; Mozes, M.; Bradley, H.; Raileanu, R.; McHardy, R. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169 2023,
  48. Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 2022,
  49. Nussbaum, Z.; Morris, J. X.; Duderstadt, B.; Mulyar, A. Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv preprint arXiv:2402.01613 2024,
  50. Gunel, B.; Du, J.; Conneau, A.; Stoyanov, V. Supervised contrastive learning for pre-trained language model fine-tuning. arXiv preprint arXiv:2011.01403 2020,
  51. Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 2021,
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Afnan Sultan (2 papers)
  2. Jochen Sieg (1 paper)
  3. Miriam Mathea (5 papers)
  4. Andrea Volkamer (3 papers)
Citations (4)

Summary

Unraveling the Potential of Transformers in Molecular Property Prediction

Overview of Transformers in MPP

Transformer models, initially conceived for natural language processing tasks, have been increasingly adapted for use in molecular property prediction (MPP). These models leverage the inherent capability of transformers to process sequential data, rendering them suitable for analyzing molecular information encoded as strings, such as SMILES and SELFIES. In this review, insights from recent research on employing transformer models for MPP are distilled, covering key considerations in model architecture, pre-training data scale, and specific adaptations for the cheminformatics domain. Despite the promising application of transformers in fields such as drug discovery and environmental science, their full potential is yet to be realized in MPP, where nuanced adaptations and considerations are necessary.

Current State of Transformer Models in MPP

Architectural Adaptations

Transformer model adaptations for MPP are categorized based on architectural differences. These involve the choice of transformer variants (e.g., BERT, RoBERTa) and modifications to better suit the chemical domain, including adjustments in tokenizer schemes, positional encoding methods, and model parameter sizes. The encoder-decoder structure of transformers facilitates both the prediction and generation of molecular structures, making them a versatile tool for various MPP tasks.

Pre-training Data and Model Scaling

A critical aspect of leveraging transformers for MPP involves the choice and scale of the pre-training data. It is observed that beyond the size of the training dataset, its composition—reflecting the chemical space relevant to the application—plays a significant role in model performance. Models pre-trained on diverse datasets like ZINC, PubChem, and ChEMBL have shown varying success. Contrary to trends in natural language processing, merely increasing model size or pre-training data scale does not guarantee improved MPP performance. This underscores the need for a balanced approach to data selection and model scaling tailored to the chemical domain's intricacies.

Domain-specific Pre-training Objectives

Integrating domain-specific objectives during pre-training shows promise in enhancing model performance for MPP tasks. Objectives tailored to capture molecular characteristics—such as structural features and physico-chemical properties—contribute to the model's ability to understand and predict relevant molecular properties more accurately. Such objectives introduce a constructive bias, steering the model towards chemically meaningful representations.

Fine-tuning Strategies

Fine-tuning strategies play a pivotal role in tailoring pre-trained models to specific MPP tasks. The decision to update model weights entirely or to freeze certain parameters during fine-tuning impacts the model's adaptability and performance on downstream tasks. The effectiveness of fine-tuning approaches varies, indicating the need for further exploration to identify optimal strategies within the domain of MPP.

Challenges and Future Directions

A unified benchmark for evaluating transformer models in MPP is imperative for fair comparison and progress assessment. Standardization in data splitting methods, comprehensive statistical analysis, and robust performance reporting are critical to achieving this goal. Additionally, exploring efficient fine-tuning techniques, innovative pre-training objectives inspired by chemical knowledge, and systematic analysis of model scaling are promising avenues. Fostering advancements in these areas could catapult transformer models to the forefront of computational approaches in MPP, unlocking new possibilities in drug discovery and beyond.

Conclusion

Transformer models harbor significant potential for revolutionizing molecular property prediction, provided that their deployment is finely tuned to the specific requirements of the chemical domain. Careful consideration of architectural adaptations, pre-training data intricacies, domain-specific objectives, and fine-tuning methodologies are paramount. By addressing current challenges and harnessing the full power of transformers, the research community can pave the way for groundbreaking advancements in predictive cheminformatics.