ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction (2010.09885v2)

Published 19 Oct 2020 in cs.LG, cs.CL, physics.chem-ph, and q-bio.BM

Abstract: GNNs and chemical fingerprints are the predominant approaches to representing molecules for property prediction. However, in NLP, transformers have become the de-facto standard for representation learning thanks to their strong downstream task transfer. In parallel, the software ecosystem around transformers is maturing rapidly, with libraries like HuggingFace and BertViz enabling streamlined training and introspection. In this work, we make one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via our ChemBERTa model. ChemBERTa scales well with pretraining dataset size, offering competitive downstream performance on MoleculeNet and useful attention-based visualization modalities. Our results suggest that transformers offer a promising avenue of future work for molecular representation learning and property prediction. To facilitate these efforts, we release a curated dataset of 77M SMILES from PubChem suitable for large-scale self-supervised pretraining.

Citations (340)

View on Semantic Scholar

Summary

The paper introduces ChemBERTa, a transformer model pretrained on 77M SMILES strings that achieves a 0.110 ROC-AUC boost with increased pretraining data.
It demonstrates the effective adaptation of self-supervised learning from NLP to enhance molecular property prediction beyond traditional methods.
The study offers pre-trained models and curated datasets, paving the way for scalable and cost-effective approaches in molecular representation learning.

An Expert Analysis of "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction"

The research article "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction" investigates the application of transformer models, specifically ChemBERTa, to molecular property prediction—a domain traditionally dominated by Graph Neural Networks (GNNs) and chemical fingerprints. This paper provides a systematic exposition of how transformer architectures, a mainstay in NLP, can be adapted and evaluated in the context of cheminformatics.

Summary of Key Findings

The paper introduces ChemBERTa, a transformer model pretrained on a colossal dataset of 77 million Simplified Molecular Input Line Entry System (SMILES) strings from PubChem. Through extensive experimentation, the authors reveal that although ChemBERTa does not surpass state-of-the-art baselines such as those provided by Chemprop, it shows promising performance improvements when scaled with more substantial training datasets. For instance, the paper reports an average increase of 0.110 in ROC-AUC by expanding pretraining from 100K to 10M samples, indicating the model's ability to harness larger data volumes to learn molecular representations effectively.

Furthermore, the paper dissects several pertinent facets of transformer utilization in cheminformatics, such as tokenizer selection and the impact of different molecular string representations, including SMILES and SELFIES (SELF-referencing Embedded Strings). Interestingly, the analysis did not find a significant performance differentiation between SMILES and SELFIES, suggesting further investigation is warranted.

Theoretical and Practical Implications

This research has various implications for molecular representation learning. The exploratory application of transformers like ChemBERTa to cheminformatics might pave the way for refined pretraining methodologies, potentially reducing the reliance on costly labeled data which is a significant barrier in this domain. Additionally, considering the competitive performance against established GNN methodologies, the scalability of transformers poses a substantial theoretical avenue for future exploration.

On the practical side, the comprehensive release of pre-trained models and curated datasets presents a valuable resource for further research and application development. By providing access to models and data, the authors facilitate community-driven advancements in transformer architectures for tasks beyond those circumscribed by the initial paper.

Future Directions

The paper opens several pathways for future investigation. Notably, scaling pretraining efforts beyond the initial 77M dataset to more extensive collections like ZINC-15 or implementing alternative pretraining objectives such as those used in ELECTRA, might enhance sample efficiency and predictive performance. Additionally, integrating graph-based inductive biases within transformer models could harmonize the strengths of GNNs and NLP transformers, potentially optimizing sample efficiency and model reliability.

In conclusion, while the performance of ChemBERTa in its current form does not yet surpass conventional techniques in molecular property prediction, its scalability and adaptability underscore a compelling area for subsequent research. Consideration of advanced pretraining strategies and hybrid model architectures may further bridge gaps in prediction accuracy, potentiating novel avenues for molecular discovery and informatics.