Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
116 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Accurate and efficient protein embedding using multi-teacher distillation learning (2405.11735v1)

Published 20 May 2024 in q-bio.GN

Abstract: Motivation: Protein embedding, which represents proteins as numerical vectors, is a crucial step in various learning-based protein annotation/classification problems, including gene ontology prediction, protein-protein interaction prediction, and protein structure prediction. However, existing protein embedding methods are often computationally expensive due to their large number of parameters, which can reach millions or even billions. The growing availability of large-scale protein datasets and the need for efficient analysis tools have created a pressing demand for efficient protein embedding methods. Results: We propose a novel protein embedding approach based on multi-teacher distillation learning, which leverages the knowledge of multiple pre-trained protein embedding models to learn a compact and informative representation of proteins. Our method achieves comparable performance to state-of-the-art methods while significantly reducing computational costs and resource requirements. Specifically, our approach reduces computational time by ~70\% and maintains almost the same accuracy as the original large models. This makes our method well-suited for large-scale protein analysis and enables the bioinformatics community to perform protein embedding tasks more efficiently.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (9)
  1. Genomic language model predicts protein co-regulation and function. Nature Communications, 15(1):2880, 2024.
  2. Deeploc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic Acids Research, 50(W1):W228–W234, 2022.
  3. Deepprosite: structure-aware protein binding site prediction using esmfold and pretrained language model. Bioinformatics, 39(12):btad718, 2023.
  4. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021. bioRxiv 10.1101/622803.
  5. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021.
  6. The unreasonable ineffectiveness of the deeper layers. arXiv preprint arXiv:2403.17887, 2024.
  7. Distilprotbert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. Bioinformatics, 38(Supplement_2):ii95–ii98, 2022.
  8. Reinforced multi-teacher selection for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14284–14291, 2021.
  9. Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence, 6(2):170–179, 2024.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com