Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
116 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

pLDDT-Predictor: High-speed Protein Screening Using Transformer and ESM2 (2410.21283v3)

Published 11 Oct 2024 in q-bio.BM, cs.AI, and cs.LG

Abstract: Recent advancements in protein structure prediction, particularly AlphaFold2, have revolutionized structural biology by achieving near-experimental accuracy ($\text{average RMSD} < 1.5\text{\AA}$). However, the computational demands of these models (approximately 30 minutes per protein on an RTX 4090) significantly limit their application in high-throughput protein screening. While LLMs like ESM (Evolutionary Scale Modeling) have shown promise in extracting structural information directly from protein sequences, rapid assessment of protein structure quality for large-scale analyses remains a major challenge. We introduce pLDDT-Predictor, a high-speed protein screening tool that achieves a $250,000\times$ speedup compared to AlphaFold2 by leveraging pre-trained ESM2 protein embeddings and a Transformer architecture. Our model predicts AlphaFold2's pLDDT (predicted Local Distance Difference Test) scores with a Pearson correlation of 0.7891 and processes proteins in just 0.007 seconds on average. Using a comprehensive dataset of 1.5 million diverse protein sequences (ranging from 50 to 2048 amino acids), we demonstrate that pLDDT-Predictor accurately classifies high-confidence structures (pLDDT $>$ 70) with 91.2\% accuracy and achieves an MSE of 84.8142 compared to AlphaFold2's predictions. The source code and pre-trained models are freely available at https://github.com/jw-chae/pLDDT_Predictor, enabling the research community to perform rapid, large-scale protein structure quality assessments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Elspeth F Garman. Developments in x-ray crystallographic structure determination of biological macromolecules. Science, 343(6175):1102–1108, 2014.
  2. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
  3. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  4. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  5. Design of highly functional genome editors by modeling the universe of crispr-cas sequences. bioRxiv, pages 2024–04, 2024.
  6. xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein. arXiv preprint arXiv:2401.06199, 2024.
  7. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
  8. Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
  9. Kw-design: Pushing the limit of protein design via knowledge refinement. In The Twelfth International Conference on Learning Representations, 2023.
  10. De novo design of buttressed loops for sculpting protein functions. Nature Chemical Biology, pages 1–7, 2024.
  11. Simulating 500 million years of evolution with a language model. bioRxiv, pages 2024–07, 2024.
  12. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  13. Prollama: A protein large language model for multi-task protein language processing. arXiv preprint arXiv:2402.16445, 2024.
  14. Designing proteins with language models. nature biotechnology, 42(2):200–202, 2024.
  15. Exploring the potential of gpt-4 in biomedical engineering: the dawn of a new era. Annals of Biomedical Engineering, 51(8):1645–1653, 2023.
  16. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021.
  17. Advances in protein structure prediction and design. Nature reviews molecular cell biology, 20(11):681–697, 2019.
  18. Tanja Kortemme. De novo protein design—from new structures to programmable functions. Cell, 187(3):526–544, 2024.
  19. De novo design of high-affinity protein binders with alphaproteo. arXiv preprint arXiv:2409.08022, 2024.
  20. De novo design of ph-responsive self-assembling helical protein filaments. Nature Nanotechnology, pages 1–6, 2024.
  21. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41(8):1099–1106, 2023.
  22. Progen2: exploring the boundaries of protein language models. Cell systems, 14(11):968–978, 2023.
  23. A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  24. Graph attention networks. stat, 1050(20):10–48550, 2017.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.