Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction (2405.06729v1)

Published 10 May 2024 in q-bio.GN and cs.LG

Abstract: Protein LLMs (PLMs) have emerged as performant and scalable tools for predicting the functional impact and clinical significance of protein-coding variants, but they still lag experimental accuracy. Here, we present a novel fine-tuning approach to improve the performance of PLMs with experimental maps of variant effects from Deep Mutational Scanning (DMS) assays using a Normalised Log-odds Ratio (NLR) head. We find consistent improvements in a held-out protein test set, and on independent DMS and clinical variant annotation benchmarks from ProteinGym and ClinVar. These findings demonstrate that DMS is a promising source of sequence diversity and supervised training data for improving the performance of PLMs for variant effect prediction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature, 599(7886):628–634, 2021.
  2. Genome-wide prediction of disease variant effects with a deep protein language model. Nature Genetics, 55(9):1512–1522, 2023.
  3. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381(6664):eadg7492, 2023.
  4. Exploring amino acid functions in a deep mutational landscape. Molecular systems biology, 17(7):e10305, 2021.
  5. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome biology, 20:1–11, 2019.
  6. Accurate classification of BRCA1 variants with saturation genome editing. Nature, 562(7726):217–222, 2018.
  7. The Atlas of Variant Effects (AVE) Alliance: understanding genetic variation at nucleotide resolution. Zenodo, 2021.
  8. Deep mutational scanning: a new style of protein science. Nature methods, 11(8):801–807, 2014.
  9. Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91–95, 2021.
  10. The landscape of tolerated genetic variation in humans and primates. Science, 380(6648):eabn8153, 2023.
  11. Learning protein fitness models from evolutionary and assay-labeled data. Nature biotechnology, 40(7):1114–1122, 2022.
  12. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  13. Cross-protein transfer learning substantially improves disease variant prediction. Genome Biology, 24(1):182, 2023.
  14. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581(7809):434–443, 2020.
  15. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic acids research, 46(D1):D1062–D1067, 2018.
  16. Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616):285–291, 2016.
  17. VariPred: Enhancing Pathogenicity Prediction of Missense Variants Using Protein Language Models. bioRxiv, pp.  2023–03, 2023a.
  18. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023b.
  19. Updated benchmarking of variant effect predictors using deep mutational scanning. Molecular Systems Biology, pp.  e11474, 2023.
  20. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
  21. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. bioRxiv, pp.  2023–12, 2023.
  22. Saturation genome editing of DDX3X clarifies pathogenicity of germline and somatic variation. Nature Communications, 14(1):7702, 2023.
  23. MSA transformer. In International Conference on Machine Learning, pp. 8844–8856. PMLR, 2021.
  24. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  25. MaveDB v2: a curated community database with over three million variant effects from multiplexed functional assays. bioRxiv, pp.  2021–11, 2021.
  26. Fine-tuning protein language models boosts predictions across diverse tasks. bioRxiv, pp.  2023–12, 2023.
  27. Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning. bioRxiv, pp.  2023–11, 2023.
  28. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028, 2017.
  29. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  30. PROSTATA: a framework for protein stability assessment using transformers. Bioinformatics, 39(11):btad671, 2023.
  31. UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1):D523–D531, 2023.
  32. wwPDB Consortium. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic acids research, 47(D1):D520–D528, 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com