Protein Representation Learning by Geometric Structure Pretraining (2203.06125v5)

Published 11 Mar 2022 in cs.LG

Abstract: Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein LLMs on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.

Citations (181)

View on Semantic Scholar

Summary

The paper demonstrates that incorporating geometric structure via GearNet yields significantly improved protein representations over traditional sequence-based methods.
It introduces GearNet-Edge with an innovative edge message passing mechanism to better capture residue-level interactions and enhance model accuracy.
Experimental results showcase robust performance across tasks such as enzyme classification, GO term prediction, and protein function annotation.

Protein Representation Learning by Geometric Structure Pretraining

The paper entitled "Protein Representation Learning by Geometric Structure Pretraining" introduces a novel method designed to enhance the efficacy of protein representation learning, particularly by emphasizing geometric structure. This research addresses a critical gap in the domain, where existing methodologies predominantly rely on protein sequence data without fully leveraging structural information. The primary contribution of this work lies in its introduction of a structure-based encoder that significantly outperforms traditional sequence-based techniques by utilizing the three-dimensional geometric features of proteins.

The approach centers on a novel model, the Geometry-Aware Relational Graph Neural Network (GearNet), which incorporates geometric structure into protein representations by encoding spatial and chemical information through a residue-level relational graph. This method is expanded in GearNet-Edge, where an edge message passing mechanism is introduced to model interactions between residues more effectively. This mechanism, derived from emerging trends in graph neural networks, demonstrates substantial improvements over existing methods.

A prominent feature of the methodology is its geometric pretraining strategy, which is situated within a self-supervised learning framework. The authors propose five distinct pretraining approaches: multiview contrastive learning and four self-prediction tasks involving residue types, distances, angular constraints, and dihedrals. Multiview contrastive learning is a noteworthy innovation that captures biological correlations within protein structures by contrasting different views derived from structural motifs. The self-prediction tasks enrich the protein representations by compelling the model to infer masked structural features.

The paper presents rigorous experimental evaluations, including benchmarks on enzyme commission (EC) number prediction, Gene Ontology (GO) term prediction, fold classification, and reaction classification tasks. The results indicate that GearNet and its variants consistently outperform state-of-the-art sequence-based models, such as those pretrained on massive datasets like ProtBERT-BFD and ESM-1b. Notably, the GearNet-Edge, even when trained on smaller datasets comprising less than a million protein structures, performs comparably or better than models utilizing extensive sequence data for pretraining.

A significant implication of this research is its demonstration of the value in integrating geometric structural information into protein representation learning. The proposed methods show potential for wide applicability in various biological tasks, including protein function annotation, protein-protein interaction prediction, and protein design. Given the benefits of these methods, future research may explore scaling these models with even larger datasets, such as the continually expanding AlphaFold database, and applying them to other relevant applications in biomedical research.

In conclusion, this paper contributes substantially to the domain of protein representation learning by illustrating that geometric pretraining is an effective strategy for improving model performance across numerous biological tasks. Its approach not only promises enhanced accuracy but also offers a new paradigm for using structural information to improve predictive models in computational biology.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (7)

GitHub

GitHub - DeepGraphLearning/GearNet: GearNet and Geometric Pretraining Methods for Protein Structure Representation Learning, ICLR'2023 (https://arxiv.org/abs/2203.06125) (298 stars)