MoCL: Data-driven Molecular Fingerprint via Knowledge-aware Contrastive Learning from Molecular Graph (2106.04509v2)

Published 5 Jun 2021 in physics.bio-ph and cs.LG

Abstract: Recent years have seen a rapid growth of utilizing graph neural networks (GNNs) in the biomedical domain for tackling drug-related problems. However, like any other deep architectures, GNNs are data hungry. While requiring labels in real world is often expensive, pretraining GNNs in an unsupervised manner has been actively explored. Among them, graph contrastive learning, by maximizing the mutual information between paired graph augmentations, has been shown to be effective on various downstream tasks. However, the current graph contrastive learning framework has two limitations. First, the augmentations are designed for general graphs and thus may not be suitable or powerful enough for certain domains. Second, the contrastive scheme only learns representations that are invariant to local perturbations and thus does not consider the global structure of the dataset, which may also be useful for downstream tasks. Therefore, in this paper, we study graph contrastive learning in the context of biomedical domain, where molecular graphs are present. We propose a novel framework called MoCL, which utilizes domain knowledge at both local- and global-level to assist representation learning. The local-level domain knowledge guides the augmentation process such that variation is introduced without changing graph semantics. The global-level knowledge encodes the similarity information between graphs in the entire dataset and helps to learn representations with richer semantics. The entire model is learned through a double contrast objective. We evaluate MoCL on various molecular datasets under both linear and semi-supervised settings and results show that MoCL achieves state-of-the-art performance.

Authors (5)

Mengying Sun (14 papers)
Jing Xing (5 papers)
Huijun Wang (8 papers)
Bin Chen (547 papers)
Jiayu Zhou (70 papers)

Citations (98)

View on Semantic Scholar

Summary

MoCL: Data-driven Molecular Fingerprint via Knowledge-aware Contrastive Learning from Molecular Graph

The paper introduces MoCL, a novel framework designed to improve molecular graph representation through knowledge-aware contrastive learning. This method is grounded in addressing two primary limitations of existing graph contrastive learning techniques: the inadequacy of general augmentation strategies for specific domains such as biomedicine, and the neglect of global data structure in representation learning.

Methodology

The authors propose leveraging domain-specific knowledge at both local and global levels to enhance the representation learning of molecular graphs. This novel approach involves:

Local-Level Domain Knowledge:
- A new augmentation strategy, substructure substitution, is implemented. This involves replacing substructures of molecular graphs with bioisosteres, which introduce variance while preserving molecular properties.
- The augmentation framework is strengthened by 230 defined transformation rules, ensuring that new augmentations do not alter the essential properties of molecular graphs, unlike generic perturbation methods such as node and edge dropping or subgraph extraction.
Global-Level Domain Knowledge:
- By incorporating global similarity information into contrastive learning, MoCL moves beyond the traditional focus on local perturbations.
- ECFPs (Extended Connectivity Fingerprints) provide structural similarity metrics, and drug-target interactions offer relevant biological activity data.
- Two strategies for global contrastive learning are explored: least square loss and contrastive loss, leveraging this global similarity information.

The framework adopts a dual contrastive objective, enhancing the graph representations by infusing both local structural and global semantic data.

Evaluation

MoCL was evaluated across several datasets, including BACE, BBBP, MUTAG, and others from the chemical and toxicology domains. The frameworks used were linear and semi-supervised evaluations to assess the pretrained model's ability to facilitate downstream tasks. Findings include:

Linear Evaluation: MoCL-DK, utilizing domain knowledge, outperformed generic augmentation methods. The combination with attribute masking often yielded the best outcomes, indicating the advantage of complex contrastive setups.
Semi-Supervised Evaluation: MoCL also showed competitive results even with limited label availability, suggesting its utility in data-scarce environments.

The MoCL framework illustrated statistically significant improvements across most datasets, particularly when local knowledge-driven augmentations were paired with global similarity incorporation.

Implications

The results indicate that MoCL lays a strong foundation for enhancing molecular graph representations, offering several implications:

Practical: MoCL offers potential benefits for drug discovery by improving predictive models of molecule functionality and interactions, crucial for developing novel therapeutics.
Theoretical: The methodological shift towards incorporating domain-specific knowledge in contrastive learning bears relevance for other graph-based learning tasks, suggesting broader applications beyond molecular graphs.

Potential Future Developments

Integration with Biological Databases: Augmenting MoCL with more extensive biological interaction networks and chemical databases could further improve its predictive accuracy and applicability.
Cross-Domain Transferability: Investigating MoCL's efficacy across different domains of graph-based data could reveal new avenues for applying knowledge-aware contrastive learning.

In conclusion, MoCL represents a substantive advancement in utilizing domain-specific augmentations and global similarities to guide contrastive learning in the molecular domain. Its methodological innovations pave the way for improved molecular graph understanding, with promising implications for both practical applications and theoretical developments in AI.

Related Papers

GitHub

GitHub - illidanlab/MoCL-DK: Implementation for the paper MoCL: Contrastive Learning on Molecular Graph with multi-level Domain Knowledge (41 stars)