OntoProtein: Protein Pretraining With Gene Ontology Embedding (2201.11147v6)

Published 23 Jan 2022 in q-bio.BM, cs.AI, cs.CL, cs.IR, and cs.LG

Abstract: Self-supervised protein LLMs have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein LLMs pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein LLMs in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. Code and datasets are available in https://github.com/zjunlp/OntoProtein.

Authors (9)

Ningyu Zhang (148 papers)
Zhen Bi (67 papers)
Xiaozhuan Liang (14 papers)
Siyuan Cheng (41 papers)
Haosen Hong (1 paper)
Shumin Deng (65 papers)
Jiazhang Lian (1 paper)
Qiang Zhang (467 papers)
Huajun Chen (199 papers)

Citations (81)

View on Semantic Scholar

Summary

The paper introduces OntoProtein, which pretrains protein models by integrating Gene Ontology knowledge through a hybrid contrastive learning and negative sampling approach.
It demonstrates significant improvements in protein-protein interaction and function prediction tasks, outperforming state-of-the-art PLMs on the TAPE benchmark.
The paper also provides ProteinKG25, a large-scale dataset of protein and GO annotations that supports further advancements in biological language modeling.

Overview of "OntoProtein: Protein Pretraining With Gene Ontology Embedding"

This paper introduces a novel model named OntoProtein, designed to enhance the pre-training of protein LLMs by incorporating structured knowledge from Gene Ontology (GO). By using a knowledge graph (KG) comprising of GO-related entities, this framework leverages external knowledge to enrich protein embeddings. This approach aims to remedy the limitations of current protein LLMs (PLMs) that typically rely solely on sequence data, overlooking the potential contributions from comprehensive biological knowledge bases.

The authors propose a hybrid methodology that coalesces contrastive learning and knowledge-aware negative sampling to optimize the integration of protein sequences and GO knowledge during the pre-training phase. These methodologies are implemented within the OntoProtein model's framework, which significantly outperforms existing models in tasks such as protein-protein interaction and protein function prediction.

Experimental validation against the TAPE benchmark demonstrates OntoProtein surpassing state-of-the-art PLMs, meaningfully advancing performance metrics in tasks that demand comprehensive protein understanding. The KL divergence between predicted and true labels decreased, indicating improved model calibration and robustness in performance.

Key strengths of OntoProtein include its ability to utilize external KG knowledge without modifying model architectures, allowing it to be easily integrated into existing systems. Furthermore, OntoProtein introduces ProteinKG25—a large-scale, enriched dataset aligned with proteins and GO annotations—indicating the paper's contribution to future research not only through methodological advancements but also by providing extensive resources.

Implications and Future Directions

OntoProtein heralds significant implications for both practical applications and theoretical advancements. Practically, its improved accuracy in protein-related predictions can benefit various domains, including drug discovery, genetic research, and synthetic biology. Theoretically, the integration of structured knowledge from KGs into PLM suggests new paradigms in the understanding of biological processes and protein LLMing.

Looking forward, intriguing possibilities can be explored further:

Extending OntoProtein's framework to incorporate additional biological databases beyond GO, which could introduce broader biological insights and improve understanding across more diverse protein functions.
Investigating methods to address the long-tail distribution of data in biological datasets, refining representation for underrepresented classes within protein datasets.
Expanding beyond sequence-based tasks to potentially encompass generative tasks aligned with protein engineering and synthetic biology applications.

In conclusion, OntoProtein marks a promising evolution in protein pre-training strategies, yielding notable results through intelligent integration of structured biological knowledge. Its capacity to harmonize protein sequence data with a comprehensive KG paves the way for further exploration in biology-informed LLMing scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - zjunlp/OntoProtein: [ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding (141 stars)