Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering (2404.15805v1)

Published 24 Apr 2024 in q-bio.BM and cs.LG

Abstract: Proteins are essential to life's processes, underpinning evolution and diversity. Advances in sequencing technology have revealed millions of proteins, underscoring the need for sophisticated pre-trained protein models for biological analysis and AI development. Facebook's ESM2, the most advanced protein LLM to date, leverages a masked prediction task for unsupervised learning, crafting amino acid representations with notable biochemical accuracy. Yet, it lacks in delivering functional protein insights, signaling an opportunity for enhancing representation quality.Our study addresses this gap by incorporating protein family classification into ESM2's training.This approach, augmented with Community Propagation-Based Clustering Algorithm, improves global protein representations, while a contextual prediction task fine-tunes local amino acid accuracy. Significantly, our model achieved state-of-the-art results in several downstream experiments, demonstrating the power of combining global and local methodologies to substantially boost protein representation quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Large-scale analysis of disease pathways in the human interactome. In PACIFIC SYMPOSIUM on BIOCOMPUTING 2018: Proceedings of the Pacific Symposium, pages 111–122. World Scientific.
  2. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876.
  3. Disc-medllm: Bridging general large language models and real-world medical consultation. arXiv preprint arXiv:2308.14346.
  4. Uniprotkb/swiss-prot, the manually annotated section of the uniprot knowledgebase: how to use the entry view. Plant bioinformatics: methods and protocols, pages 23–54.
  5. Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110.
  6. A benchmark for automatic medical consultation system: frameworks, tasks and datasets. Bioinformatics, 39(1):btac817.
  7. Dxformer: a decoupled automatic diagnostic system based on decoder–encoder transformer with dense symptom representations. Bioinformatics, 39(1):btac744.
  8. Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381(6664):eadg7492.
  9. Natural language processing. Fundamentals of artificial intelligence, pages 603–649.
  10. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
  11. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv preprint arXiv:2308.02773.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  13. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127.
  14. Facebook Research. 2024. ESM: Evolutionary Scale Modeling. Accessed: 2024-02-14.
  15. Ai hospital: Interactive evaluation and collaboration of llms as intern doctors for clinical diagnosis. arXiv preprint arXiv:2402.09742.
  16. Scope: Structural classification of proteins—extended, integrating scop and astral data and classification of new structures. Nucleic acids research, 42(D1):D304–D309.
  17. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell systems, 6(1):116–124.
  18. Automatic generation of electromyogram diagnosis report. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1645–1650. IEEE.
  19. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.
  20. Anomalygpt: Detecting industrial anomalies using large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1932–1940.
  21. John A Hartigan and Manchek A Wong. 1979. Algorithm as 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), 28(1):100–108.
  22. Improved prediction of mhc-peptide binding using protein language models. Frontiers in Bioinformatics, 3.
  23. Pre-training co-evolutionary protein representation via a pairwise masked language model. arXiv preprint arXiv:2110.15527.
  24. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589.
  25. Uniprot archive. Bioinformatics, 20(17):3236–3237.
  26. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902.
  27. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  28. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303.
  29. Fionn Murtagh and Pedro Contreras. 2012. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97.
  30. Pauline C Ng and Steven Henikoff. 2006. Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet., 7:61–80.
  31. Proteinnpt: Improving protein property prediction and design with non-parametric transformers. bioRxiv, pages 2023–12.
  32. The language of proteins: Nlp, machine learning & protein sequences. Computational and Structural Biotechnology Journal, 19:1750–1758.
  33. Metabolic pathways in the post-genome era. Trends in biochemical sciences, 28(5):250–258.
  34. Interpro in 2022. Nucleic Acids Research, 51(D1):D418–D427.
  35. Msa transformer. In International Conference on Machine Learning, pages 8844–8856. PMLR.
  36. Deep generative models of genetic variation capture the effects of mutations. Nature methods, 15(10):816–822.
  37. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118.
  38. Johannes Söding and Michael Remmert. 2011. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Current opinion in structural biology, 21(3):404–411.
  39. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pages 2023–10.
  40. Deciphering the protein landscape with protflash, a lightweight language model. Cell Reports Physical Science, 4(10).
  41. Netgo 3.0: Protein language model improves large-scale functional annotations. Genomics, Proteomics & Bioinformatics.
  42. Single-sequence protein structure prediction using supervised transformer protein language models. Nature Computational Science, 2(12):804–814.
  43. Multi-level protein structure pre-training via prompt learning. In The Eleventh International Conference on Learning Representations.
  44. Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction. Scientific reports, 12(1):6832.
  45. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  46. Disc-lawllm: Fine-tuning large language models for intelligent legal services. arXiv preprint arXiv:2309.11325.
  47. Enhancing protein language models with structure-based encoder and pre-training. arXiv preprint arXiv:2303.06275.
  48. Hierarchical reinforcement learning for automatic disease diagnosis. Bioinformatics, 38(16):3995–4001.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com