ProteinGym: Protein Fitness Benchmark

Updated 17 November 2025

ProteinGym is a benchmark suite that curates and normalizes over 2.5 million variant measurements from deep mutational scanning experiments.
It supports diverse modeling approaches—including sequence, structure, and multimodal methods—and evaluates models using metrics like Spearman’s rank correlation.
Its stratified protocols and public datasets drive advances in protein representation learning, zero-shot prediction, and generative design.

ProteinGym is a standardized, large-scale benchmark suite for evaluating protein fitness prediction models—both unsupervised and supervised—against deep mutational scanning (DMS) data spanning a wide range of protein families, variant types, and biological functions. By curating and normalizing thousands of experimental measurements related to protein sequence variation (substitutions, insertions, deletions), ProteinGym enables rigorous, stratified assessment across function modality, taxonomic origin, and mutation complexity. Its public datasets, established protocols, and benchmarking conventions have catalyzed methodological advances in protein representation learning, zero-shot prediction, variant effect analysis, and generative design.

1. Composition and Dataset Structure

ProteinGym consolidates >2.5 million variant measurements from deep mutational scanning experiments encompassing 217 substitution-focused assays and 66 indel (insertion/deletion) assays. These assays cover five principal functional readouts: organismal fitness, enzymatic activity, stability, binding affinity, and expression levels; each assay provides fitness proxies for a wild-type sequence $S^{wt}$ and an exhaustive set of mutant variants $S^{mt}$ (Notin et al., 2022, Fan et al., 8 Oct 2025). Sequence coverage spans 200+ UniProt IDs, representing human, eukaryotic, prokaryotic, and viral proteins.

Variant types tracked include single-residue substitutions, multi-residue substitutions (double, triple, and higher-order), and single/double/triple indels. ProteinGym’s taxonomic coverage includes humans (33 proteins), prokaryotes, other eukaryotes, and viruses, supporting stratified benchmarking by phylogeny. Protein-specific multiple sequence alignments (MSAs; jackhmmer/UniRef100), AlphaFold2-predicted structures, and full experimental measurement tables are provided, enabling modality-specific model evaluation.

2. Data Acquisition, Curation, and Normalization

The suite was assembled via comprehensive literature and database mining of >130 published DMS studies, filtering for sufficient dynamic range, target specificity (proteins only), and publicly available raw data (Notin et al., 2022). Silent variants are omitted, duplicates averaged, and bimodal distribution thresholds for binarization set manually or at the median for ROC/AUC calculations. For MSAs, columns with >30 % gaps and sequences with >50 % gaps are pruned; experimental measurements are correspondingly filtered to maintain exact alignment.

Variants are scored at the UniProt-ID level, with averages calculated per assay, per protein, functional class, and taxon. Stratification by MSA depth (Low: $N_{eff}/L<1$ , Medium: $1–100$, High: $>100$ ), mutation depth (single, double, triple), and sequence similarity enables granular performance breakdowns and out-of-distribution testing.

3. Evaluation Protocols and Benchmarking Metrics

ProteinGym standardizes model comparison by enforcing a strict zero-shot setting: models are not fine-tuned or supervised on the assay data used for evaluation (Fan et al., 8 Oct 2025, Kantroo et al., 2024). Fitness scoring for a model parameterized by $\theta$ typically adopts one of two conventions:

Likelihood Ratio (Autoregressive): $F_x = \log \frac{P(x_{\mathrm{mut}})}{P(x_{\mathrm{wt}})}$
Log-Odds (Masked-LM): $\hat F(S^{mt},S^{wt}) = \sum_{i \in \mathcal{M}} [\log P(s_i^{mt} | S_{\setminus \mathcal{M}}) - \log P(s_i^{wt} | S_{\setminus \mathcal{M}})]$

The primary performance metric is Spearman’s rank correlation ( $\rho$ ) between predicted and experimental fitness, adjusted so that higher scores consistently reflect improved fitness. Secondary metrics include:

AUC: Area under the ROC curve for binary beneficial/deleterious classification (binarized by median/manual threshold).
Matthews Correlation Coefficient (MCC): Assessed on binarized labels.
Normalized Discounted Cumulative Gain (NDCG@k): Evaluates top- $k$ ranking accuracy.
Top-10 % Recall: Measures the fraction of true top-10 % experimental variants among the predicted top-10 % (Sharma et al., 23 Apr 2025).

Benchmarked protocols specify variant preprocessing, aggregation over proteins, and stratified reporting, allowing direct, reproducible head-to-head comparisons.

4. Modalities and Model Families Benchmarked

ProteinGym supports sequence-based, structure-based, and multi-modal/ensemble methods:

Sequence/Likelihood Models: ESM (PLM family, 650 M–15 B params), Tranception, GEMME, VESPA, DeepSequence. Masked LLMs (MLMs) use pseudo-perplexity and log-odds (e.g., –PPPL, NLR).
Structure-Based Models: ESM-IF1 (inverse folding), ProteinMPNN, S2F, S3F, ProtSSN, SaProt, SSEmb; typically leverage AlphaFold2-predicted monomer structures.
Ensembles/Multi-modal: TranceptEVE, Metalic (meta-learning over context), EvoIF (profile fusion), combining sequence, structure, MSA, and evolutionary signals.
Diffusion Indel Models: SCISOR: discrete-time insertion-only forward with reverse deletion planning (Baron et al., 10 Nov 2025).
Matrix VAEs: matVAE, integrating sequence, structural priors, and supervised DMS fitting (Honoré et al., 3 Jul 2025).

Specialized architectures are evaluated for robustness across shallow MSAs, multi-mutant/epistatic landscapes, and IDR (intrinsically disordered region)-rich sequences.

5. Empirical Performance and Methodological Impact

ProteinGym has framed the current state of the art for protein fitness prediction. Zero-shot sequence-only baselines (ESM-2 650 M) achieve Spearman $\rho \approx 0.414$ ; sequence+structure hybrids (S3F) reach $\rho = 0.470$ ; ensembles (TranceptEVE, EvoIF-MSA) set the top marks up to $\rho=0.518$ without further training or additional labels (Fan et al., 8 Oct 2025, Zhang et al., 2024). On indel prediction, SCISOR attains $\rho \approx 0.573$ , exceeding prior generative and autoregressive baselines (Baron et al., 10 Nov 2025).

Function-type stratification reveals that structure-aware methods notably excel on stability assays (e.g., ESM-IF1 stability $\rho=0.624$ ), while multi-modal ensembles outperform sequence or structure-only models across binding, expression, and organismal fitness. DMS-based fine-tuning (NLR, matENC-DMS + AF) yields double-digit relative gains over zero-shot models without increasing model complexity (Lafita et al., 2024, Honoré et al., 3 Jul 2025).

Table: Example Spearman $\rho$ Across Model Families (Substitution Benchmarks)

Model/Modality	Mean $\rho$	Stability	Binding
ESM2 OFS-PP (Indels)	0.574	0.582	≈0.53
S3F (Seq+Str+Surf)	0.470	–	–
EvoIF-MSA (Ensemble)	0.518	–	–
SCISOR (Indels)	0.573	–	–
ESM-1v NLR-tuned	0.396	–	–

6. Practical Guidelines and Benchmarking Conventions

ProteinGym codifies protocol best practices for reproducible benchmarking:

Use UniProt-ID as the aggregation level; average per protein, not per individual variant, to avoid overweighting large-assay proteins.
Preprocess as specified (no silent variants, duplicates averaged, variants without measurement dropped).
Stratify evaluations by MSA depth, mutation depth, and taxonomic kingdom to expose performance dependencies.
For sequence retrieval methods, employ jackhmmer over UniRef100, filter by coverage and identity (≤20 % id recommended for out-of-sample assessment).
For structure-based assays, prefer AlphaFold2 monomer predictions unless confirmed, functional experimental structures are available; mask or exclude low-confidence pLDDT regions.
Report both zero-shot and fine-tuned metrics, keeping hyperparameters (e.g., hybrid weights α) consistent across evaluations.
Supported modality fusion (sequence/structure/MSA), ensemble aggregation, and model ablation are encouraged for robustness breakdowns.

7. Limitations, Extensions, and Prospective Directions

ProteinGym’s reference data favor substitution and single-site mutation landscapes; multi-mutant and indel assays remain underrepresented relative to biological diversity. Coverage gaps occur for proteins lacking complete structural, MSA, or DMS annotation; particularly for IDRs, structure-based methods lose predictive accuracy. Automated structure-to-assay matching may misalign with functional states (e.g., active/inactive conformations), warranting manual curation or metadata harmonization (Sharma et al., 23 Apr 2025, Fan et al., 8 Oct 2025).

Emerging research signals several extension opportunities:

DMS datasets are poised to supplant MSAs for many variant effect prediction tasks, especially under low evolutionary pressure scenarios (e.g., pharmacogenomics) (Honoré et al., 3 Jul 2025).
Hybrid models integrating unsupervised sequence/structure learning, supervised DMS fitting, and explicit geometric priors are achieving parity with expensive large models.
Diffusion-based generative approaches (e.g., SCISOR) address indel effect estimation and open efficient pathways for rational protein shrinkage and motif preservation.
The benchmark motivates new architectures able to handle conformational heterogeneity in IDRs, epistatic interactions, and task-driven generative design under function-specific constraints.

A plausible implication is that standardized, high-diversity benchmarks such as ProteinGym will increasingly define progress in generalizable protein fitness modeling and functional sequence design, aligning predictive architectures with empirical discovery workflows.

Markdown Upgrade to Chat

References (8)

Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval (2022)

Evolutionary Profiles for Protein Fitness Prediction (2025)

Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation (2024)

Exploring zero-shot structure-based protein fitness prediction (2025)

A Diffusion Model to Shrink Proteins While Maintaining Their Function (2025)

A Matrix Variational Auto-Encoder for Variant Effect Prediction in Pharmacogenes (2025)

Multi-Scale Representation Learning for Protein Fitness Prediction (2024)

Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProteinGym.

ProteinGym: Protein Fitness Benchmark

1. Composition and Dataset Structure

2. Data Acquisition, Curation, and Normalization

3. Evaluation Protocols and Benchmarking Metrics

4. Modalities and Model Families Benchmarked

5. Empirical Performance and Methodological Impact

6. Practical Guidelines and Benchmarking Conventions

7. Limitations, Extensions, and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

ProteinGym: Protein Fitness Benchmark

1. Composition and Dataset Structure

2. Data Acquisition, Curation, and Normalization

3. Evaluation Protocols and Benchmarking Metrics

4. Modalities and Model Families Benchmarked

5. Empirical Performance and Methodological Impact

6. Practical Guidelines and Benchmarking Conventions

7. Limitations, Extensions, and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research