Molecular representation learning with language models and domain-relevant auxiliary tasks (2011.13230v1)

Published 26 Nov 2020 in cs.LG and cs.AI

Abstract: We apply a Transformer architecture, specifically BERT, to learn flexible and high quality molecular representations for drug discovery problems. We study the impact of using different combinations of self-supervised tasks for pre-training, and present our results for the established Virtual Screening and QSAR benchmarks. We show that: i) The selection of appropriate self-supervised task(s) for pre-training has a significant impact on performance in subsequent downstream tasks such as Virtual Screening. ii) Using auxiliary tasks with more domain relevance for Chemistry, such as learning to predict calculated molecular properties, increases the fidelity of our learnt representations. iii) Finally, we show that molecular representations learnt by our model `MolBert' improve upon the current state of the art on the benchmark datasets.

Citations (109)

View on Semantic Scholar

Summary

The paper introduces MolBert, integrating self-supervised tasks like PhysChemPred to refine molecular embeddings for drug discovery.
MolBert outperforms traditional descriptors like ECFP and neural methods such as CDDD in both QSAR and Virtual Screening benchmarks.
Input permutation of SMILES combined with domain-specific tasks enables more robust and discriminative molecular representations.

Molecular Representation Learning with LLMs and Domain-Relevant Auxiliary Tasks

The paper investigates the application of the Transformer architecture, specifically BERT, to develop molecular representations that enhance drug discovery tasks such as Quantitative Structure-Activity Relationship (QSAR) modeling and Virtual Screening. This exploration is centered around the novel utilization of self-supervised tasks that incorporate domain-relevant auxiliary data during the pre-training phase, aiming to construct refined and flexible molecular embeddings.

Methodology and Findings

The paper introduces a model named "MolBert," designed to leverage the capabilities of BERT for molecule representation learning. MolBert's architecture is augmented by several self-supervised tasks, namely, Masked LLMing (MaskedLM), SMILES equivalence (SMILES-Eq), and Physicochemical Property Prediction (PhysChemPred), which are critical in shaping the model's understanding of molecular data. The combination of these tasks significantly influences the representation's effectiveness in downstream applications. Notable findings from the experimental assessments include:

Task Impact on Performance: Among the self-supervised tasks evaluated, the PhysChemPred task exhibited substantial influence on the model's performance across both QSAR and Virtual Screening benchmarks. Its integration allowed MolBert to surpass the current state-of-the-art methodologies. MaskedLM also contributed positively, albeit marginally, when combined with PhysChemPred.
Evaluation Results: MolBert demonstrated superior applicability in Virtual Screening, outperforming conventional descriptors like Extended Connectivity Fingerprint (ECFP) and recently introduced neural-based descriptors like CDDD. On QSAR benchmarks, MolBert's performance also excelled, particularly when the embeddings were fine-tuned with task-specific heads.
Role of Input Permutation: Permuting the SMILES representations during pre-training facilitated better discrimination between molecular representations, enhancing the model's retrieval accuracy and consistency for repeated molecular embeddings.

Implications and Future Directions

The results from MolBERT underline the potential of BERT-based architectures in cheminformatics, marking a significant stride toward utilizing transformer models for molecular data representation. Integrating domain-specific auxiliary tasks not only enhances the fidelity of the molecular embeddings but also provides a pathway to improve various drug discovery processes, such as ligand prediction and compound screening.

This approach opens avenues for further exploration in adapting LLMs to other molecular entities such as proteins. Future research could delve into pre-training strategies that incorporate even more domain-specific tasks, enabling the generation of embeddings that are more informative and aligned with biochemical properties. Additionally, augmenting MolBert with multi-modal data inputs, such as integrating structural protein data, might refine predictions on complex bioactivity interactions, suggesting a leap towards comprehensive AI-driven drug design and discovery platforms.

PDF Markdown

Related Papers

YouTube

Show All Videos