Motif-based Graph Self-Supervised Learning for Molecular Property Prediction (2110.00987v2)

Published 3 Oct 2021 in q-bio.QM, cs.AI, and cs.LG

Abstract: Predicting molecular properties with data-driven methods has drawn much attention in recent years. Particularly, Graph Neural Networks (GNNs) have demonstrated remarkable success in various molecular generation and prediction tasks. In cases where labeled data is scarce, GNNs can be pre-trained on unlabeled molecular data to first learn the general semantic and structural information before being fine-tuned for specific tasks. However, most existing self-supervised pre-training frameworks for GNNs only focus on node-level or graph-level tasks. These approaches cannot capture the rich information in subgraphs or graph motifs. For example, functional groups (frequently-occurred subgraphs in molecular graphs) often carry indicative information about the molecular properties. To bridge this gap, we propose Motif-based Graph Self-supervised Learning (MGSSL) by introducing a novel self-supervised motif generation framework for GNNs. First, for motif extraction from molecular graphs, we design a molecule fragmentation method that leverages a retrosynthesis-based algorithm BRICS and additional rules for controlling the size of motif vocabulary. Second, we design a general motif-based generative pre-training framework in which GNNs are asked to make topological and label predictions. This generative framework can be implemented in two different ways, i.e., breadth-first or depth-first. Finally, to take the multi-scale information in molecular graphs into consideration, we introduce a multi-level self-supervised pre-training. Extensive experiments on various downstream benchmark tasks show that our methods outperform all state-of-the-art baselines.

PDF Abstract

Motif-based Graph Self-Supervised Learning for Molecular Property Prediction

The paper "Motif-based Graph Self-Supervised Learning for Molecular Property Prediction" introduces an innovative framework termed Motif-based Graph Self-Supervised Learning (MGSSL) that enhances the efficacy of Graph Neural Networks (GNNs) in molecular property prediction. The proposed framework leverages the intrinsic structural motifs within molecular graphs, which address the shortcomings of existing self-supervised learning paradigms that predominantly focus on node-level or graph-level tasks. This method underscores the value of capturing the semantic and structural intricacies at the level of subgraphs, particularly the functional groups in molecular chemistry.

Methodology

The authors implement MGSSL by introducing a novel motif generation framework which comprises several key stages:

Motif Extraction: Utilizing a retrosynthesis-based algorithm, BRICS, along with additional rules, molecular graphs are fragmented. This fragmentation leads to the identification of motifs, which are significant subgraph patterns. This choice of fragmentation technique targets the generation of motifs that are both chemically meaningful and computationally advantageous for training a GNN.
Generative Pre-Training Framework: A multi-scale generative pre-training mechanism is employed, where GNNs predict topological and motif labels in a motif-wise generation process. Two distinct ordering methods, breadth-first and depth-first, are explored to enhance motif-based graph generative processes.
Multi-level Self-Supervised Pre-Training: By integrating multi-level pre-training approaches, the paper proposes MGSSL to account for atomic-level and motif-level features. This strategy involves adapting the weights of various self-supervised learning tasks using the Frank-Wolfe algorithm to harmonize these tasks within the neural network training process.

Experimental Findings

Extensive experiments on several molecular property prediction benchmarks demonstrate that MGSSL outperforms all evaluated state-of-the-art self-supervised learning baselines for GNNs. Particularly, pre-trained models using MGSSL exhibit improved average ROC-AUC scores across various datasets. The method demonstrates efficacy across different GNN architectures, showcasing gains in model performance and convergence speed during training.

Implications and Future Directions

The proposed framework presents a notable advancement by encapsulating rich structural and semantic information from motifs in graph-based data, particularly useful in domains requiring molecular property prediction. Practically, MGSSL reduces the dependency on scarce labeled data, utilizing abundant unlabeled molecular data efficiently.

Theoretically, the incorporation of motif-level information enriches the representation capabilities of GNNs, prompting further research into motif-centric graph learning paradigms. Future works could explore the adaptation of motif-based self-supervised learning in broader domains beyond molecular chemistry, as well as the refinement of motif extraction and generation processes to enhance model generalization and scalability.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Zaixi Zhang (34 papers)
Qi Liu (485 papers)
Hao Wang (1119 papers)
Chengqiang Lu (14 papers)
Chee-Kong Lee (16 papers)

Citations (225)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - zaixizhang/MGSSL: Official implementation of NeurIPS'21 paper"Motif-based Graph Self-Supervised Learning for Molecular Property Prediction" (122 stars)