pLDDT-Predictor: High-speed Protein Screening Using Transformer and ESM2 (2410.21283v3)
Abstract: Recent advancements in protein structure prediction, particularly AlphaFold2, have revolutionized structural biology by achieving near-experimental accuracy ($\text{average RMSD} < 1.5\text{\AA}$). However, the computational demands of these models (approximately 30 minutes per protein on an RTX 4090) significantly limit their application in high-throughput protein screening. While LLMs like ESM (Evolutionary Scale Modeling) have shown promise in extracting structural information directly from protein sequences, rapid assessment of protein structure quality for large-scale analyses remains a major challenge. We introduce pLDDT-Predictor, a high-speed protein screening tool that achieves a $250,000\times$ speedup compared to AlphaFold2 by leveraging pre-trained ESM2 protein embeddings and a Transformer architecture. Our model predicts AlphaFold2's pLDDT (predicted Local Distance Difference Test) scores with a Pearson correlation of 0.7891 and processes proteins in just 0.007 seconds on average. Using a comprehensive dataset of 1.5 million diverse protein sequences (ranging from 50 to 2048 amino acids), we demonstrate that pLDDT-Predictor accurately classifies high-confidence structures (pLDDT $>$ 70) with 91.2\% accuracy and achieves an MSE of 84.8142 compared to AlphaFold2's predictions. The source code and pre-trained models are freely available at https://github.com/jw-chae/pLDDT_Predictor, enabling the research community to perform rapid, large-scale protein structure quality assessments.
- Elspeth F Garman. Developments in x-ray crystallographic structure determination of biological macromolecules. Science, 343(6175):1102–1108, 2014.
- Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Design of highly functional genome editors by modeling the universe of crispr-cas sequences. bioRxiv, pages 2024–04, 2024.
- xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein. arXiv preprint arXiv:2401.06199, 2024.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
- Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
- Kw-design: Pushing the limit of protein design via knowledge refinement. In The Twelfth International Conference on Learning Representations, 2023.
- De novo design of buttressed loops for sculpting protein functions. Nature Chemical Biology, pages 1–7, 2024.
- Simulating 500 million years of evolution with a language model. bioRxiv, pages 2024–07, 2024.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
- Prollama: A protein large language model for multi-task protein language processing. arXiv preprint arXiv:2402.16445, 2024.
- Designing proteins with language models. nature biotechnology, 42(2):200–202, 2024.
- Exploring the potential of gpt-4 in biomedical engineering: the dawn of a new era. Annals of Biomedical Engineering, 51(8):1645–1653, 2023.
- Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021.
- Advances in protein structure prediction and design. Nature reviews molecular cell biology, 20(11):681–697, 2019.
- Tanja Kortemme. De novo protein design—from new structures to programmable functions. Cell, 187(3):526–544, 2024.
- De novo design of high-affinity protein binders with alphaproteo. arXiv preprint arXiv:2409.08022, 2024.
- De novo design of ph-responsive self-assembling helical protein filaments. Nature Nanotechnology, pages 1–6, 2024.
- Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41(8):1099–1106, 2023.
- Progen2: exploring the boundaries of protein language models. Cell systems, 14(11):968–978, 2023.
- A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
- Graph attention networks. stat, 1050(20):10–48550, 2017.