Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction (2405.06729v1)
Abstract: Protein LLMs (PLMs) have emerged as performant and scalable tools for predicting the functional impact and clinical significance of protein-coding variants, but they still lag experimental accuracy. Here, we present a novel fine-tuning approach to improve the performance of PLMs with experimental maps of variant effects from Deep Mutational Scanning (DMS) assays using a Normalised Log-odds Ratio (NLR) head. We find consistent improvements in a held-out protein test set, and on independent DMS and clinical variant annotation benchmarks from ProteinGym and ClinVar. These findings demonstrate that DMS is a promising source of sequence diversity and supervised training data for improving the performance of PLMs for variant effect prediction.
- Exome sequencing and analysis of 454,787 UK Biobank participants. Nature, 599(7886):628–634, 2021.
- Genome-wide prediction of disease variant effects with a deep protein language model. Nature Genetics, 55(9):1512–1522, 2023.
- Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381(6664):eadg7492, 2023.
- Exploring amino acid functions in a deep mutational landscape. Molecular systems biology, 17(7):e10305, 2021.
- MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome biology, 20:1–11, 2019.
- Accurate classification of BRCA1 variants with saturation genome editing. Nature, 562(7726):217–222, 2018.
- The Atlas of Variant Effects (AVE) Alliance: understanding genetic variation at nucleotide resolution. Zenodo, 2021.
- Deep mutational scanning: a new style of protein science. Nature methods, 11(8):801–807, 2014.
- Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91–95, 2021.
- The landscape of tolerated genetic variation in humans and primates. Science, 380(6648):eabn8153, 2023.
- Learning protein fitness models from evolutionary and assay-labeled data. Nature biotechnology, 40(7):1114–1122, 2022.
- LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Cross-protein transfer learning substantially improves disease variant prediction. Genome Biology, 24(1):182, 2023.
- The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581(7809):434–443, 2020.
- ClinVar: improving access to variant interpretations and supporting evidence. Nucleic acids research, 46(D1):D1062–D1067, 2018.
- Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616):285–291, 2016.
- VariPred: Enhancing Pathogenicity Prediction of Missense Variants Using Protein Language Models. bioRxiv, pp. 2023–03, 2023a.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023b.
- Updated benchmarking of variant effect predictors using deep mutational scanning. Molecular Systems Biology, pp. e11474, 2023.
- Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
- ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. bioRxiv, pp. 2023–12, 2023.
- Saturation genome editing of DDX3X clarifies pathogenicity of germline and somatic variation. Nature Communications, 14(1):7702, 2023.
- MSA transformer. In International Conference on Machine Learning, pp. 8844–8856. PMLR, 2021.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- MaveDB v2: a curated community database with over three million variant effects from multiplexed functional assays. bioRxiv, pp. 2021–11, 2021.
- Fine-tuning protein language models boosts predictions across diverse tasks. bioRxiv, pp. 2023–12, 2023.
- Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning. bioRxiv, pp. 2023–11, 2023.
- MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028, 2017.
- Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- PROSTATA: a framework for protein stability assessment using transformers. Bioinformatics, 39(11):btad671, 2023.
- UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1):D523–D531, 2023.
- wwPDB Consortium. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic acids research, 47(D1):D520–D528, 2019.