Data Scaling in Protein Modeling

Updated 31 July 2025

The paper reveals that added protein sequence data yields nonmonotonic improvements in model performance due to redundancy, noise, and diversity.
Data scaling is key to both unsupervised and supervised settings, with supervised fine-tuning showing more stable and pronounced gains.
The research underscores the need for targeted data acquisition to fill diversity gaps rather than relying solely on larger volume.

Data scaling in protein modeling refers to how the amount, composition, and diversity of training data—most commonly, protein sequences—influence the capabilities and performance of protein LLMs (pLMs), especially as models are pretrained with ever-larger datasets. Unlike natural language data, biological sequence data is typically highly redundant, noisy, and sparse, raising the question of when, if ever, additional data ceases to yield significant improvements in downstream tasks such as protein function prediction. Major work in this area has focused on empirical studies of pLMs trained on longitudinal UniRef100 snapshots and analyzing their supervised and unsupervised performance on protein function prediction benchmarks, including deep mutational scanning (DMS) datasets. The following sections systematically describe the key findings and principles from recent research (Spinner et al., 29 Jul 2025).

1. Impact of Data Scaling on Model Performance

The effect of increasing pretraining data was measured by training suites of pLMs—AMPLIFY models—on yearly versions of UniRef100 (2011–2024) and evaluating zero-shot performance on DMS function prediction tasks from ProteinGym. With each yearly snapshot, the number of available protein sequences increases by tens to hundreds of millions. Despite the monotonically growing data corpus, improvements in zero-shot performance (as assessed by Spearman correlation between pLM log-likelihoods and experimental fitness scores) do not increase monotonically. There are years where major data increases are accompanied by stagnation or even a decline in performance, highlighting the nonuniform effect of data addition. For example, the addition of an extra billion sequences from 2018 to 2021 did not always result in higher Spearman correlations for all targets, with occasional decreases observed.

This sensitivity is attributed to the composition of added data, notably redundancy, noise, and diversity, rather than volume per se. Sequence redundancy is a prevalent feature of biological datasets, as new sequences often belong to overrepresented protein families. The consequence is that, while more data generally trends toward better performance, the marginal benefit is variable and sometimes negative depending on which sequences are added.

2. Evidence (or Lack Thereof) for Model Saturation

Model saturation refers to the empirical or theoretical point at which additional pretraining data fails to yield further meaningful performance improvements for a given task or metric. In the studied models and tasks (Spinner et al., 29 Jul 2025), no evidence of saturation was found in protein function prediction. Performance as measured by Spearman correlation between zero-shot model predictions and experimental DMS scores continued to improve across the UniRef100 snapshots, though with occasionally nonmonotonic steps.

The continued gains, albeit fluctuating, indicate that pLMs have not yet reached a state where further data is fully redundant or uninformative. This holds even as the UniRef100 dataset approaches nearly 3 billion sequences, suggesting the functional space explored by pLMs is insufficiently covered and the complexity or diversity necessary for optimal predictive performance is still increasing with more data. The lack of saturation contrasts with some expectations from NLP or small-molecule domains, emphasizing unique biological data characteristics.

3. Comparison of Unsupervised and Supervised Regimes

Performance trends differ between unsupervised (pure pretraining) and supervised (downstream fine-tuning) paradigms. In the unsupervised, zero-shot setting, the AMPLIFY models’ predictions—computed as sequence log-likelihoods—show a gradual, generally positive improvement with more pretraining data. For instance, in the E. coli β-Lactamase case study, unsupervised Spearman correlation with DMS scores rose from ~0.25 (earliest datasets) to over 0.6 (latest snapshots).

In supervised experiments, where sequence embeddings from the pretrained models are used in ridge regression with varying amounts of labeled DMS data, performance improvements are more pronounced and more stable. With just 10% of labeled data, performance rises from an average unsupervised value of ~0.38 to 0.52; with 80% labeled data, it further increases to ~0.675. Moreover, supervised models can outperform unsupervised predictions at all but the most mature pretraining snapshots. This underscores the continued importance of experimental supervision, even as pretraining corpora expand.

4. Case Study: β-Lactamase Deep Mutational Scanning

The β-Lactamase protein from E. coli, with its four DMS datasets collected between 2012 and 2015, serves as a high-quality testbed for quantifying year-over-year improvements under data scaling. In the unsupervised scenario, the model’s functional predictions show a steady rise each year, closely tracking the expansion and diversification of the reference database. A supervised variant using ridge regression achieves high, stable performance almost regardless of the pretraining snapshot. The combination of DMS datasets in supervised training further boosts and stabilizes prediction accuracy. This demonstrates the complementary effects of pretraining data scale and targeted experimental supervision.

5. Role of Data Quality: Redundancy, Noise, and Diversity

Fluctuations in performance with added data are not attributable to random variation but reflect underlying issues in the nature of biological data. Much of the sequence expansion in databases like UniRef100 is redundant: large numbers of similar or nearly identical sequences are added for already well-characterized protein families. Noisy data—incorrectly sequenced, misannotated, or fragmented entries—can introduce irrelevant or even misleading signals. Diversity, particularly in underrepresented protein classes or evolutionary clades, is more likely to provide new, informative functional constraints.

The nonmonotonic improvement trend suggests that data composition, not only corpus size, must be critically considered. This finding motivates a shift toward “targeted data acquisition” where filling biological sequence diversity gaps outperforms mere aggregate volume increases (Spinner et al., 29 Jul 2025). Such a strategy is consistent with lessons in other computational biology domains where coverage of novel sequence or structure space is more beneficial than additional sampling of oversaturated groups.

6. Targeted Data Acquisition and Future Directions

One of the principal conclusions is the need to move beyond indiscriminate bulk data collection. The observed scaling behavior suggests that continued investments in more intelligently curated and diverse data—prioritizing underrepresented protein folds, families, or functions—should yield greater marginal improvements than expansions limited to already well-represented sequence space.

The study recommends further directions, such as clustering-based data splits, exploring other pLM architectures or training paradigms, and investigating transfer learning between different experimental datasets. There is also an emphasis on systematically evaluating the impact of data redundancy and quality, which can be assisted by metrics quantifying sequence diversity, coverage of functional space, and removal of low-quality or ambiguous entries.

7. Technical Metrics and Quantitative Results

Zero-shot fitness prediction is operationalized as the log-likelihood L of a protein sequence x of length T: $L = \sum_{i=1}^T \log p(x_i~|~x_1, \ldots, x_{i-1})$ with p(·) representing the softmax-normalized output logits from the pLM.

Supervised regression employs ridge regression on sequence embeddings X and experimental fitness targets y: $\hat{\beta} = \arg\min_{\beta} \|X\beta - y\|^2 + \lambda\|\beta\|^2$ where λ is a regularization parameter.

Model evaluations use the Spearman correlation coefficient between model outputs (log-likelihoods or regression predictions) and DMS-measured fitness.

Summary Table: Scaling, Saturation, and Performance Trends

Scenario	Effect of More Data	Monotonic Gains?	Saturation Observed?	Key Additional Influence
Unsupervised (zero-shot)	Gradual, sometimes nonmonotonic rise	No	No	Sensitive to data composition
Supervised (with labels)	Pronounced, stable improvement	Generally yes	No	Even limited labels yield large benefits
Function prediction (e.g. β-Lact)	Year-over-year improvement, unsup: slower than with supervision	No	No	Combining datasets stabilizes performance

Differences in data quality (redundancy, noise, diversity) are a major determinant of when scaling provides clear benefit versus plateau or noise-induced stagnation. No mode, to date, has exhibited unequivocal saturation in sequence modeling for protein function prediction (Spinner et al., 29 Jul 2025).

Concluding Perspective

Data scaling in protein LLMs for function prediction reveals that while increased data drives progress, the effect is both nonmonotonic and strongly modulated by data composition. Saturation on function tasks has not been reached, and distinct benefits persist for supervised approaches, even with moderate amounts of labeled data. These findings underscore the importance of targeted data acquisition, balancing unsupervised scale with the continued integration of experimental measurements, and deeper systematic study of the scaling phenomena in protein modeling.

PDF Markdown Chat (Pro)

References (1)

Scaling and Data Saturation in Protein Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Data Scaling in Protein Modeling.