Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training on test proteins improves fitness, structure, and function prediction (2411.02109v1)

Published 4 Nov 2024 in cs.LG and q-bio.BM

Abstract: Data scarcity and distribution shifts often hinder the ability of machine learning models to generalize when applied to proteins and other biological data. Self-supervised pre-training on large datasets is a common method to enhance generalization. However, striving to perform well on all possible proteins can limit model's capacity to excel on any specific one, even though practitioners are often most interested in accurate predictions for the individual protein they study. To address this limitation, we propose an orthogonal approach to achieve generalization. Building on the prevalence of self-supervised pre-training, we introduce a method for self-supervised fine-tuning at test time, allowing models to adapt to the test protein of interest on the fly and without requiring any additional data. We study our test-time training (TTT) method through the lens of perplexity minimization and show that it consistently enhances generalization across different models, their scales, and datasets. Notably, our method leads to new state-of-the-art results on the standard benchmark for protein fitness prediction, improves protein structure prediction for challenging targets, and enhances function prediction accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (95)
  1. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pp.  1–3, 2024.
  2. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019.
  3. Deeploc: prediction of protein subcellular localization using deep learning. Bioinformatics, 33(21):3387–3395, 2017.
  4. Glucokinase activity in diabetes: too much of a good thing? Trends in Endocrinology & Metabolism, 34(2):119–130, Feb 2023. ISSN 1043-2760. doi: 10.1016/j.tem.2022.12.007. URL https://doi.org/10.1016/j.tem.2022.12.007.
  5. Self-supervised test-time learning for reading comprehension. arXiv preprint arXiv:2103.11263, 2021.
  6. Pada: Example-based prompt learning for on-the-fly adaptation to unseen domains. Transactions of the Association for Computational Linguistics, 10:414–433, 2022.
  7. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  8. Learning to design protein-protein interactions with enhanced generalization. arXiv preprint arXiv:2310.18515, 2023.
  9. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
  10. Hotprotein: A novel framework for protein thermostability prediction and editing. NeurIPS 2022, 2022.
  11. Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381(6664):eadg7492, 2023.
  12. Adapting to distribution shift by visual domain prompt generation. arXiv preprint arXiv:2405.02797, 2024.
  13. David W. Christianson. Structural and chemical biology of terpenoid cyclases. Chemical Reviews, 117(17):11570–11648, Sep 2017. ISSN 0009-2665. doi: 10.1021/acs.chemrev.7b00287. URL https://doi.org/10.1021/acs.chemrev.7b00287.
  14. The UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023. Nucleic acids research, 51(D1):D523–D531, 2023.
  15. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  16. Stability oracle: a structure-based graph-transformer for identifying stabilizing mutations. BioRxiv, pp.  2023–05, 2023.
  17. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proceedings of the National Academy of Sciences, 121(6):e2314853121, 2024.
  18. Uncovering new families and folds in the natural protein universe. Nature, 622(7983):646–653, 2023.
  19. Improving inverse folding models at protein stability prediction without additional training or data. bioRxiv, pp.  2024–06, 2024.
  20. Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv preprint arXiv:2301.06568, 2023.
  21. Mavedb: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome biology, 20:1–11, 2019.
  22. A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Machine Intelligence, 5(10):1087–1096, 2023.
  23. Deep reinforcement learning for modelling protein complexes. arXiv preprint arXiv:2405.02299, 2024.
  24. Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
  25. Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91–95, 2021.
  26. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 35:29374–29385, 2022.
  27. Unsupervised domain adaptation by backpropagation. In Francis R. Bach and David M. Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp.  1180–1189. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/ganin15.html.
  28. Jan Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry, 28(5-6):367–374, 2004.
  29. Sestrin mediates detection of and adaptation to low-leucine diets in drosophila. Nature, 608(7921):209–216, Aug 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-04960-2. URL https://doi.org/10.1038/s41586-022-04960-2.
  30. cgas–sting drives ageing-related inflammation and neurodegeneration. Nature, 620(7973):374–380, Aug 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06373-1. URL https://doi.org/10.1038/s41586-023-06373-1.
  31. Structure of dimeric lipoprotein lipase reveals a pore adjacent to the active site. Nature Communications, 14(1):2569, May 2023. ISSN 2041-1723. doi: 10.1038/s41467-023-38243-9. URL https://doi.org/10.1038/s41467-023-38243-9.
  32. Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. arXiv preprint arXiv:2305.18466, 2023.
  33. Simulating 500 million years of evolution with a language model. bioRxiv, pp.  2024–07, 2024.
  34. Bilingual language model for protein sequence and structure. bioRxiv, pp.  2023–07, 2023.
  35. Deriving language models from masked language models. arXiv preprint arXiv:2305.15501, 2023.
  36. Mutation effects predicted from sequence co-variation. Nature biotechnology, 35(2):128–135, 2017.
  37. The pi3k–akt network at the interface of oncogenic signalling and cancer metabolism. Nature Reviews Cancer, 20(2):74–88, Feb 2020. ISSN 1474-1768. doi: 10.1038/s41568-019-0216-7. URL https://doi.org/10.1038/s41568-019-0216-7.
  38. Learning inverse folding from millions of predicted structures. In International conference on machine learning, pp.  8946–8970. PMLR, 2022.
  39. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  40. Alphafold meets flow matching for generating protein ensembles. arXiv preprint arXiv:2402.04845, 2024.
  41. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021.
  42. Pseudo-perplexity in one fell swoop for protein fitness estimation. bioRxiv, pp.  2024–07, 2024.
  43. Test-time adaptable neural networks for robust medical image segmentation. Medical Image Analysis, 68:101907, 2021.
  44. Lactb is a tumour suppressor that modulates lipid metabolism and cell state. Nature, 543(7647):681–686, Mar 2017. ISSN 1476-4687. doi: 10.1038/nature21408. URL https://doi.org/10.1038/nature21408.
  45. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  46. Improving protein optimization with smoothed fitness landscapes. In The Twelfth International Conference on Learning Representations, 2023.
  47. Machine learning-guided protein engineering. ACS catalysis, 13(21):13863–13895, 2023.
  48. Critical assessment of methods of protein structure prediction (casp)—round xv. Proteins: Structure, Function, and Bioinformatics, 91(12):1539–1549, 2023. doi: https://doi.org/10.1002/prot.26617. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.26617.
  49. Gemme: a simple and fast global epistatic model predicting mutational effects. Molecular biology and evolution, 36(11):2604–2619, 2019.
  50. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. doi: 10.1126/science.ade2574. URL https://www.science.org/doi/abs/10.1126/science.ade2574.
  51. TTT++: when does self-supervised test-time training fail or thrive? In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  21808–21820, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/b618c3210e934362ac261db280128c22-Abstract.html.
  52. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  53. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41(8):1099–1106, 2023.
  54. lddt: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 29(21):2722–2728, 2013.
  55. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems, 34:29287–29303, 2021.
  56. Clipzyme: Reaction-conditioned virtual screening of enzymes. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=0mYAK6Yhhm.
  57. Peptide-binding specificity prediction using fine-tuned protein structure prediction networks. Proceedings of the National Academy of Sciences, 120(9):e2216697120, 2023.
  58. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pp.  16990–17017. PMLR, 2022a.
  59. Trancepteve: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction. bioRxiv, pp.  2022–12, 2022b.
  60. Proteingym: Large-scale benchmarks for protein fitness prediction and design. Advances in Neural Information Processing Systems, 36, 2024.
  61. Apoe isoform– and microbiota-dependent progression of neurodegeneration in a mouse model of tauopathy. Science, 379(6628):eadd1236, 2023. doi: 10.1126/science.add1236. URL https://www.science.org/doi/abs/10.1126/science.add1236.
  62. A rugged yet easily navigable fitness landscape. Science, 382(6673):eadh3860, 2023. doi: 10.1126/science.adh3860. URL https://www.science.org/doi/abs/10.1126/science.adh3860.
  63. Predrag Radivojac and et al. A large-scale evaluation of computational protein function prediction. Nature Methods, 10(3):221–227, Mar 2013. ISSN 1548-7105. doi: 10.1038/nmeth.2340. URL https://doi.org/10.1038/nmeth.2340.
  64. Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
  65. Transformer protein language models are unsupervised structure learners. Biorxiv, pp.  2020–12, 2020.
  66. Msa transformer. In International Conference on Machine Learning, pp.  8844–8856. PMLR, 2021.
  67. Generating diverse high-fidelity images with VQ-VAE-2. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  14837–14847, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/5f8e2fa1718d1bbcadf1cd9c7a54fb8c-Abstract.html.
  68. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  69. Continuous automated model evaluation (cameo)—perspectives on the future of fully automated evaluation of structure prediction methods. Proteins: Structure, Function, and Bioinformatics, 89(12):1977–1986, 2021.
  70. Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  71. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  22500–22510, 2023.
  72. Masked language model scoring. arXiv preprint arXiv:1910.14659, 2019.
  73. Highly accurate discovery of terpene synthases powered by machine learning reveals functional terpene cyclization in archaea. bioRxiv, 2024. doi: 10.1101/2024.01.29.577750. URL https://www.biorxiv.org/content/early/2024/04/25/2024.01.29.577750.
  74. Current progress and open challenges for applying deep learning across the biosciences. Nature Communications, 13(1):1728, Apr 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-29268-7. URL https://doi.org/10.1038/s41467-022-29268-7.
  75. Prot-vae: protein transformer variational autoencoder for functional protein design. bioRxiv, pp.  2023–01, 2023.
  76. Discovery of novel gain-of-function mutations guided by structure-based deep learning. ACS synthetic biology, 9(11):2927–2935, 2020.
  77. Accurately predicting enzyme functions through geometric graph learning on esmfold-predicted structures. Nature Communications, 15(1):8180, 2024.
  78. Light attention predicts protein location from the language of life. Bioinformatics Advances, 1(1):vbab035, 11 2021. ISSN 2635-0041. doi: 10.1093/bioadv/vbab035. URL https://doi.org/10.1093/bioadv/vbab035.
  79. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pp.  2023–10, 2023.
  80. A paradigm shift in structural biology. Nature Methods, 19(1):20–23, Jan 2022. ISSN 1548-7105. doi: 10.1038/s41592-021-01361-7. URL https://doi.org/10.1038/s41592-021-01361-7.
  81. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pp.  9229–9248. PMLR, 2020.
  82. Antibody domainbed: Out-of-distribution generalization in therapeutic protein design. arXiv preprint arXiv:2407.21028, 2024.
  83. Mega-scale experimental analysis of protein folding stability in biology and design. Nature, 620(7973):434–444, 2023.
  84. From genomics to proteomics. Nature, 422(6928):193–197, Mar 2003. ISSN 1476-4687. doi: 10.1038/nature01510. URL https://doi.org/10.1038/nature01510.
  85. Foldseek: fast and accurate protein structure search. Biorxiv, pp.  2022–02, 2022.
  86. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
  87. A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  88. Test-time training on video streams. arXiv preprint arXiv:2307.05014, 2023.
  89. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
  90. Learning to generalize across domains on single test samples. arXiv preprint arXiv:2202.08045, 2022.
  91. Opportunities and challenges for machine learning-assisted enzyme engineering. ACS Central Science, 10(2):226–241, Feb 2024. ISSN 2374-7943. doi: 10.1021/acscentsci.3c01275. URL https://doi.org/10.1021/acscentsci.3c01275.
  92. Enzyme function prediction using contrastive learning. Science, 379(6639):1358–1363, 2023. doi: 10.1126/science.adf2465. URL https://www.science.org/doi/abs/10.1126/science.adf2465.
  93. Adaptive risk minimization: Learning to adapt to domain shift. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  23664–23678, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/c705112d1ec18b97acac7e2d63973424-Abstract.html.
  94. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
  95. On pitfalls of test-time adaptation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  42058–42080. PMLR, 2023. URL https://proceedings.mlr.press/v202/zhao23d.html.

Summary

  • The paper introduces Test-Time Training, a method that fine-tunes pre-trained protein models on individual proteins to boost prediction performance.
  • It employs a self-supervised masked language modeling approach to adapt the backbone model during test time, reducing perplexity.
  • The technique significantly improves predictions in fitness, structure, and function, especially for proteins with limited training data.

Evaluation of Test-Time Training in Protein Prediction Models

The paper presents a novel approach to improving protein prediction tasks by applying Test-Time Training (TTT). This technique involves the self-supervised fine-tuning of protein models during test time, particularly focusing on the single protein of interest. The method aims to enhance model generalization, yielding state-of-the-art results across various protein-related predictions, such as fitness, structure, and function.

Methodology

Traditional models for protein predictions, while powerful, often struggle with specificity for individual proteins, mainly due to data scarcity and distribution shifts in large datasets. The paper proposes a shift from the general approach, using TTT to adapt pre-trained protein models to a specific protein at test time, thus bridging the gap between broad dataset-wide optimizations and precise, protein-specific insights.

TTT leverages the prevalent use of masked LLMing (MLM) in protein machine learning, employing it as the objective for self-supervised fine-tuning. Specifically, during TTT, the backbone of the model (f) is adapted to reduce perplexity on the given protein sequence while the task-specific head (h) remains fixed, thus maintaining task-specific priors and leveraging improved representations learned by f.

Results

The application of TTT to various models demonstrated consistent improvement across multiple protein-related tasks:

  1. Protein Fitness Prediction: The application of TTT to models like ESM2 and SaProt not only improved their performance on datasets like ProteinGym and MaveDB but also surpassed existing benchmarks, notably in phenotypes such as organismal fitness and binding. The improvement was particularly significant on proteins with low representation in training data, highlighting TTT's utility in scenarios of data scarcity.
  2. Protein Structure Prediction: Using datasets like CAMEO, the paper shows that models such as ESMFold and ESM3 enhanced their performance significantly with TTT, outperforming baselines that applied different approaches like masked predictions or chain-of-thought decoding.
  3. Protein Function Prediction: The method improved classification accuracy in tasks involving terpene synthase substrates and subcellular localization, emphasizing the broad applicability of TTT across different classification settings.

Theoretical and Practical Implications

The paper establishes a link between minimizing perplexity on a single protein and improved downstream performance. This insight not only helps explain TTT's effectiveness but also informs future work in applying TTT to other domains. Practically, TTT’s ability to fine-tune complex models on the fly can be invaluable in real-world applications where specific proteins of interest must be analyzed without abundant related data available.

Future Directions

The research opens up several avenues for future work, such as exploring deeper understanding of TTT’s success and failure modes and extending these techniques to more complex foundation models. Additionally, exploring adaptation methods like domain adaptation and adaptive risk minimization could further enhance protein model generalization and adaptation capabilities.

In summary, the paper makes a strong case for TTT in enhancing machine learning predictions specific to individual proteins, addressing the boundaries of current model generalizations, and setting a research path toward more targeted and efficient protein analysis methodologies.