Scaling and Data Saturation in Protein Language Models (2507.22210v1)

Published 29 Jul 2025 in q-bio.QM

Abstract: Data in biology is redundant, noisy, and sparse. How does the type and scale of available data impact model performance? In this work, we specifically investigate how protein LLMs (pLMs) scale with increasing pretraining data. We investigate this relationship by measuring the performance of protein function prediction on a suite of pLMs pretrained on yearly snapshots of UniRef100 from 2011 to 2024. We find no evidence of model saturation on this task: performance improves--but not monotonically--with added data, and this trend differs between unsupervised and supervised experiments. Using a well-characterized Beta-Lactamase protein from E. coli, we find that unsupervised model predictions get better year-over-year, though they do not yet consistently perform better than the supervised baseline. Our results underscore the need for targeted data acquisition and deeper study of data scaling in protein modeling. All training, inference, analysis, and visualization code is available at: https://github.com/Align-to-Innovate/data-saturation-and-scaling.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (3)

GitHub

GitHub - Align-to-Innovate/data-saturation-and-scaling: All inference and training scripts and relevant data for Spinner, A., DeBenedictis, E., & Hudson, C. M. (2025). Scaling and Data Saturation in Protein Language Models. Proceedings of the ICML GenBio Workshop 2025.

Tweets

https://twitter.com/AvivSpinner/status/1950959049157062805

https://twitter.com/AvivSpinner/status/1950959045205991479

https://twitter.com/XTXI/status/1950853518836822020

alphaXiv

Scaling and Data Saturation in Protein Language Models (7 likes, 0 questions)