Training on test proteins improves fitness, structure, and function prediction (2411.02109v1)
Abstract: Data scarcity and distribution shifts often hinder the ability of machine learning models to generalize when applied to proteins and other biological data. Self-supervised pre-training on large datasets is a common method to enhance generalization. However, striving to perform well on all possible proteins can limit model's capacity to excel on any specific one, even though practitioners are often most interested in accurate predictions for the individual protein they study. To address this limitation, we propose an orthogonal approach to achieve generalization. Building on the prevalence of self-supervised pre-training, we introduce a method for self-supervised fine-tuning at test time, allowing models to adapt to the test protein of interest on the fly and without requiring any additional data. We study our test-time training (TTT) method through the lens of perplexity minimization and show that it consistently enhances generalization across different models, their scales, and datasets. Notably, our method leads to new state-of-the-art results on the standard benchmark for protein fitness prediction, improves protein structure prediction for challenging targets, and enhances function prediction accuracy.
- Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pp. 1–3, 2024.
- Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019.
- Deeploc: prediction of protein subcellular localization using deep learning. Bioinformatics, 33(21):3387–3395, 2017.
- Glucokinase activity in diabetes: too much of a good thing? Trends in Endocrinology & Metabolism, 34(2):119–130, Feb 2023. ISSN 1043-2760. doi: 10.1016/j.tem.2022.12.007. URL https://doi.org/10.1016/j.tem.2022.12.007.
- Self-supervised test-time learning for reading comprehension. arXiv preprint arXiv:2103.11263, 2021.
- Pada: Example-based prompt learning for on-the-fly adaptation to unseen domains. Transactions of the Association for Computational Linguistics, 10:414–433, 2022.
- Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Learning to design protein-protein interactions with enhanced generalization. arXiv preprint arXiv:2310.18515, 2023.
- One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
- Hotprotein: A novel framework for protein thermostability prediction and editing. NeurIPS 2022, 2022.
- Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381(6664):eadg7492, 2023.
- Adapting to distribution shift by visual domain prompt generation. arXiv preprint arXiv:2405.02797, 2024.
- David W. Christianson. Structural and chemical biology of terpenoid cyclases. Chemical Reviews, 117(17):11570–11648, Sep 2017. ISSN 0009-2665. doi: 10.1021/acs.chemrev.7b00287. URL https://doi.org/10.1021/acs.chemrev.7b00287.
- The UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023. Nucleic acids research, 51(D1):D523–D531, 2023.
- Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Stability oracle: a structure-based graph-transformer for identifying stabilizing mutations. BioRxiv, pp. 2023–05, 2023.
- Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proceedings of the National Academy of Sciences, 121(6):e2314853121, 2024.
- Uncovering new families and folds in the natural protein universe. Nature, 622(7983):646–653, 2023.
- Improving inverse folding models at protein stability prediction without additional training or data. bioRxiv, pp. 2024–06, 2024.
- Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv preprint arXiv:2301.06568, 2023.
- Mavedb: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome biology, 20:1–11, 2019.
- A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Machine Intelligence, 5(10):1087–1096, 2023.
- Deep reinforcement learning for modelling protein complexes. arXiv preprint arXiv:2405.02299, 2024.
- Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
- Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91–95, 2021.
- Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 35:29374–29385, 2022.
- Unsupervised domain adaptation by backpropagation. In Francis R. Bach and David M. Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp. 1180–1189. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/ganin15.html.
- Jan Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry, 28(5-6):367–374, 2004.
- Sestrin mediates detection of and adaptation to low-leucine diets in drosophila. Nature, 608(7921):209–216, Aug 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-04960-2. URL https://doi.org/10.1038/s41586-022-04960-2.
- cgas–sting drives ageing-related inflammation and neurodegeneration. Nature, 620(7973):374–380, Aug 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06373-1. URL https://doi.org/10.1038/s41586-023-06373-1.
- Structure of dimeric lipoprotein lipase reveals a pore adjacent to the active site. Nature Communications, 14(1):2569, May 2023. ISSN 2041-1723. doi: 10.1038/s41467-023-38243-9. URL https://doi.org/10.1038/s41467-023-38243-9.
- Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. arXiv preprint arXiv:2305.18466, 2023.
- Simulating 500 million years of evolution with a language model. bioRxiv, pp. 2024–07, 2024.
- Bilingual language model for protein sequence and structure. bioRxiv, pp. 2023–07, 2023.
- Deriving language models from masked language models. arXiv preprint arXiv:2305.15501, 2023.
- Mutation effects predicted from sequence co-variation. Nature biotechnology, 35(2):128–135, 2017.
- The pi3k–akt network at the interface of oncogenic signalling and cancer metabolism. Nature Reviews Cancer, 20(2):74–88, Feb 2020. ISSN 1474-1768. doi: 10.1038/s41568-019-0216-7. URL https://doi.org/10.1038/s41568-019-0216-7.
- Learning inverse folding from millions of predicted structures. In International conference on machine learning, pp. 8946–8970. PMLR, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Alphafold meets flow matching for generating protein ensembles. arXiv preprint arXiv:2402.04845, 2024.
- Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021.
- Pseudo-perplexity in one fell swoop for protein fitness estimation. bioRxiv, pp. 2024–07, 2024.
- Test-time adaptable neural networks for robust medical image segmentation. Medical Image Analysis, 68:101907, 2021.
- Lactb is a tumour suppressor that modulates lipid metabolism and cell state. Nature, 543(7647):681–686, Mar 2017. ISSN 1476-4687. doi: 10.1038/nature21408. URL https://doi.org/10.1038/nature21408.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Improving protein optimization with smoothed fitness landscapes. In The Twelfth International Conference on Learning Representations, 2023.
- Machine learning-guided protein engineering. ACS catalysis, 13(21):13863–13895, 2023.
- Critical assessment of methods of protein structure prediction (casp)—round xv. Proteins: Structure, Function, and Bioinformatics, 91(12):1539–1549, 2023. doi: https://doi.org/10.1002/prot.26617. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.26617.
- Gemme: a simple and fast global epistatic model predicting mutational effects. Molecular biology and evolution, 36(11):2604–2619, 2019.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. doi: 10.1126/science.ade2574. URL https://www.science.org/doi/abs/10.1126/science.ade2574.
- TTT++: when does self-supervised test-time training fail or thrive? In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 21808–21820, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/b618c3210e934362ac261db280128c22-Abstract.html.
- Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41(8):1099–1106, 2023.
- lddt: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 29(21):2722–2728, 2013.
- Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems, 34:29287–29303, 2021.
- Clipzyme: Reaction-conditioned virtual screening of enzymes. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=0mYAK6Yhhm.
- Peptide-binding specificity prediction using fine-tuned protein structure prediction networks. Proceedings of the National Academy of Sciences, 120(9):e2216697120, 2023.
- Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pp. 16990–17017. PMLR, 2022a.
- Trancepteve: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction. bioRxiv, pp. 2022–12, 2022b.
- Proteingym: Large-scale benchmarks for protein fitness prediction and design. Advances in Neural Information Processing Systems, 36, 2024.
- Apoe isoform– and microbiota-dependent progression of neurodegeneration in a mouse model of tauopathy. Science, 379(6628):eadd1236, 2023. doi: 10.1126/science.add1236. URL https://www.science.org/doi/abs/10.1126/science.add1236.
- A rugged yet easily navigable fitness landscape. Science, 382(6673):eadh3860, 2023. doi: 10.1126/science.adh3860. URL https://www.science.org/doi/abs/10.1126/science.adh3860.
- Predrag Radivojac and et al. A large-scale evaluation of computational protein function prediction. Nature Methods, 10(3):221–227, Mar 2013. ISSN 1548-7105. doi: 10.1038/nmeth.2340. URL https://doi.org/10.1038/nmeth.2340.
- Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
- Transformer protein language models are unsupervised structure learners. Biorxiv, pp. 2020–12, 2020.
- Msa transformer. In International Conference on Machine Learning, pp. 8844–8856. PMLR, 2021.
- Generating diverse high-fidelity images with VQ-VAE-2. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 14837–14847, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/5f8e2fa1718d1bbcadf1cd9c7a54fb8c-Abstract.html.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- Continuous automated model evaluation (cameo)—perspectives on the future of fully automated evaluation of structure prediction methods. Proteins: Structure, Function, and Bioinformatics, 89(12):1977–1986, 2021.
- Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500–22510, 2023.
- Masked language model scoring. arXiv preprint arXiv:1910.14659, 2019.
- Highly accurate discovery of terpene synthases powered by machine learning reveals functional terpene cyclization in archaea. bioRxiv, 2024. doi: 10.1101/2024.01.29.577750. URL https://www.biorxiv.org/content/early/2024/04/25/2024.01.29.577750.
- Current progress and open challenges for applying deep learning across the biosciences. Nature Communications, 13(1):1728, Apr 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-29268-7. URL https://doi.org/10.1038/s41467-022-29268-7.
- Prot-vae: protein transformer variational autoencoder for functional protein design. bioRxiv, pp. 2023–01, 2023.
- Discovery of novel gain-of-function mutations guided by structure-based deep learning. ACS synthetic biology, 9(11):2927–2935, 2020.
- Accurately predicting enzyme functions through geometric graph learning on esmfold-predicted structures. Nature Communications, 15(1):8180, 2024.
- Light attention predicts protein location from the language of life. Bioinformatics Advances, 1(1):vbab035, 11 2021. ISSN 2635-0041. doi: 10.1093/bioadv/vbab035. URL https://doi.org/10.1093/bioadv/vbab035.
- Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pp. 2023–10, 2023.
- A paradigm shift in structural biology. Nature Methods, 19(1):20–23, Jan 2022. ISSN 1548-7105. doi: 10.1038/s41592-021-01361-7. URL https://doi.org/10.1038/s41592-021-01361-7.
- Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pp. 9229–9248. PMLR, 2020.
- Antibody domainbed: Out-of-distribution generalization in therapeutic protein design. arXiv preprint arXiv:2407.21028, 2024.
- Mega-scale experimental analysis of protein folding stability in biology and design. Nature, 620(7973):434–444, 2023.
- From genomics to proteomics. Nature, 422(6928):193–197, Mar 2003. ISSN 1476-4687. doi: 10.1038/nature01510. URL https://doi.org/10.1038/nature01510.
- Foldseek: fast and accurate protein structure search. Biorxiv, pp. 2022–02, 2022.
- Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
- A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
- Test-time training on video streams. arXiv preprint arXiv:2307.05014, 2023.
- De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
- Learning to generalize across domains on single test samples. arXiv preprint arXiv:2202.08045, 2022.
- Opportunities and challenges for machine learning-assisted enzyme engineering. ACS Central Science, 10(2):226–241, Feb 2024. ISSN 2374-7943. doi: 10.1021/acscentsci.3c01275. URL https://doi.org/10.1021/acscentsci.3c01275.
- Enzyme function prediction using contrastive learning. Science, 379(6639):1358–1363, 2023. doi: 10.1126/science.adf2465. URL https://www.science.org/doi/abs/10.1126/science.adf2465.
- Adaptive risk minimization: Learning to adapt to domain shift. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 23664–23678, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/c705112d1ec18b97acac7e2d63973424-Abstract.html.
- Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
- On pitfalls of test-time adaptation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 42058–42080. PMLR, 2023. URL https://proceedings.mlr.press/v202/zhao23d.html.