ProteinGym: Protein Fitness Benchmark
- ProteinGym is a benchmark suite that curates and normalizes over 2.5 million variant measurements from deep mutational scanning experiments.
- It supports diverse modeling approaches—including sequence, structure, and multimodal methods—and evaluates models using metrics like Spearman’s rank correlation.
- Its stratified protocols and public datasets drive advances in protein representation learning, zero-shot prediction, and generative design.
ProteinGym is a standardized, large-scale benchmark suite for evaluating protein fitness prediction models—both unsupervised and supervised—against deep mutational scanning (DMS) data spanning a wide range of protein families, variant types, and biological functions. By curating and normalizing thousands of experimental measurements related to protein sequence variation (substitutions, insertions, deletions), ProteinGym enables rigorous, stratified assessment across function modality, taxonomic origin, and mutation complexity. Its public datasets, established protocols, and benchmarking conventions have catalyzed methodological advances in protein representation learning, zero-shot prediction, variant effect analysis, and generative design.
1. Composition and Dataset Structure
ProteinGym consolidates >2.5 million variant measurements from deep mutational scanning experiments encompassing 217 substitution-focused assays and 66 indel (insertion/deletion) assays. These assays cover five principal functional readouts: organismal fitness, enzymatic activity, stability, binding affinity, and expression levels; each assay provides fitness proxies for a wild-type sequence and an exhaustive set of mutant variants (Notin et al., 2022, Fan et al., 8 Oct 2025). Sequence coverage spans 200+ UniProt IDs, representing human, eukaryotic, prokaryotic, and viral proteins.
Variant types tracked include single-residue substitutions, multi-residue substitutions (double, triple, and higher-order), and single/double/triple indels. ProteinGym’s taxonomic coverage includes humans (33 proteins), prokaryotes, other eukaryotes, and viruses, supporting stratified benchmarking by phylogeny. Protein-specific multiple sequence alignments (MSAs; jackhmmer/UniRef100), AlphaFold2-predicted structures, and full experimental measurement tables are provided, enabling modality-specific model evaluation.
2. Data Acquisition, Curation, and Normalization
The suite was assembled via comprehensive literature and database mining of >130 published DMS studies, filtering for sufficient dynamic range, target specificity (proteins only), and publicly available raw data (Notin et al., 2022). Silent variants are omitted, duplicates averaged, and bimodal distribution thresholds for binarization set manually or at the median for ROC/AUC calculations. For MSAs, columns with >30 % gaps and sequences with >50 % gaps are pruned; experimental measurements are correspondingly filtered to maintain exact alignment.
Variants are scored at the UniProt-ID level, with averages calculated per assay, per protein, functional class, and taxon. Stratification by MSA depth (Low: , Medium: $1–100$, High: ), mutation depth (single, double, triple), and sequence similarity enables granular performance breakdowns and out-of-distribution testing.
3. Evaluation Protocols and Benchmarking Metrics
ProteinGym standardizes model comparison by enforcing a strict zero-shot setting: models are not fine-tuned or supervised on the assay data used for evaluation (Fan et al., 8 Oct 2025, Kantroo et al., 9 Jul 2024). Fitness scoring for a model parameterized by typically adopts one of two conventions:
- Likelihood Ratio (Autoregressive):
- Log-Odds (Masked-LM):
The primary performance metric is Spearman’s rank correlation () between predicted and experimental fitness, adjusted so that higher scores consistently reflect improved fitness. Secondary metrics include:
- AUC: Area under the ROC curve for binary beneficial/deleterious classification (binarized by median/manual threshold).
- Matthews Correlation Coefficient (MCC): Assessed on binarized labels.
- Normalized Discounted Cumulative Gain (NDCG@k): Evaluates top- ranking accuracy.
- Top-10 % Recall: Measures the fraction of true top-10 % experimental variants among the predicted top-10 % (Sharma et al., 23 Apr 2025).
Benchmarked protocols specify variant preprocessing, aggregation over proteins, and stratified reporting, allowing direct, reproducible head-to-head comparisons.
4. Modalities and Model Families Benchmarked
ProteinGym supports sequence-based, structure-based, and multi-modal/ensemble methods:
- Sequence/Likelihood Models: ESM (PLM family, 650 M–15 B params), Tranception, GEMME, VESPA, DeepSequence. Masked LLMs (MLMs) use pseudo-perplexity and log-odds (e.g., –PPPL, NLR).
- Structure-Based Models: ESM-IF1 (inverse folding), ProteinMPNN, S2F, S3F, ProtSSN, SaProt, SSEmb; typically leverage AlphaFold2-predicted monomer structures.
- Ensembles/Multi-modal: TranceptEVE, Metalic (meta-learning over context), EvoIF (profile fusion), combining sequence, structure, MSA, and evolutionary signals.
- Diffusion Indel Models: SCISOR: discrete-time insertion-only forward with reverse deletion planning (Baron et al., 10 Nov 2025).
- Matrix VAEs: matVAE, integrating sequence, structural priors, and supervised DMS fitting (Honoré et al., 3 Jul 2025).
Specialized architectures are evaluated for robustness across shallow MSAs, multi-mutant/epistatic landscapes, and IDR (intrinsically disordered region)-rich sequences.
5. Empirical Performance and Methodological Impact
ProteinGym has framed the current state of the art for protein fitness prediction. Zero-shot sequence-only baselines (ESM-2 650 M) achieve Spearman ; sequence+structure hybrids (S3F) reach ; ensembles (TranceptEVE, EvoIF-MSA) set the top marks up to without further training or additional labels (Fan et al., 8 Oct 2025, Zhang et al., 2 Dec 2024). On indel prediction, SCISOR attains , exceeding prior generative and autoregressive baselines (Baron et al., 10 Nov 2025).
Function-type stratification reveals that structure-aware methods notably excel on stability assays (e.g., ESM-IF1 stability ), while multi-modal ensembles outperform sequence or structure-only models across binding, expression, and organismal fitness. DMS-based fine-tuning (NLR, matENC-DMS + AF) yields double-digit relative gains over zero-shot models without increasing model complexity (Lafita et al., 10 May 2024, Honoré et al., 3 Jul 2025).
Table: Example Spearman Across Model Families (Substitution Benchmarks)
| Model/Modality | Mean | Stability | Binding |
|---|---|---|---|
| ESM2 OFS-PP (Indels) | 0.574 | 0.582 | ≈0.53 |
| S3F (Seq+Str+Surf) | 0.470 | – | – |
| EvoIF-MSA (Ensemble) | 0.518 | – | – |
| SCISOR (Indels) | 0.573 | – | – |
| ESM-1v NLR-tuned | 0.396 | – | – |
6. Practical Guidelines and Benchmarking Conventions
ProteinGym codifies protocol best practices for reproducible benchmarking:
- Use UniProt-ID as the aggregation level; average per protein, not per individual variant, to avoid overweighting large-assay proteins.
- Preprocess as specified (no silent variants, duplicates averaged, variants without measurement dropped).
- Stratify evaluations by MSA depth, mutation depth, and taxonomic kingdom to expose performance dependencies.
- For sequence retrieval methods, employ jackhmmer over UniRef100, filter by coverage and identity (≤20 % id recommended for out-of-sample assessment).
- For structure-based assays, prefer AlphaFold2 monomer predictions unless confirmed, functional experimental structures are available; mask or exclude low-confidence pLDDT regions.
- Report both zero-shot and fine-tuned metrics, keeping hyperparameters (e.g., hybrid weights α) consistent across evaluations.
- Supported modality fusion (sequence/structure/MSA), ensemble aggregation, and model ablation are encouraged for robustness breakdowns.
7. Limitations, Extensions, and Prospective Directions
ProteinGym’s reference data favor substitution and single-site mutation landscapes; multi-mutant and indel assays remain underrepresented relative to biological diversity. Coverage gaps occur for proteins lacking complete structural, MSA, or DMS annotation; particularly for IDRs, structure-based methods lose predictive accuracy. Automated structure-to-assay matching may misalign with functional states (e.g., active/inactive conformations), warranting manual curation or metadata harmonization (Sharma et al., 23 Apr 2025, Fan et al., 8 Oct 2025).
Emerging research signals several extension opportunities:
- DMS datasets are poised to supplant MSAs for many variant effect prediction tasks, especially under low evolutionary pressure scenarios (e.g., pharmacogenomics) (Honoré et al., 3 Jul 2025).
- Hybrid models integrating unsupervised sequence/structure learning, supervised DMS fitting, and explicit geometric priors are achieving parity with expensive large models.
- Diffusion-based generative approaches (e.g., SCISOR) address indel effect estimation and open efficient pathways for rational protein shrinkage and motif preservation.
- The benchmark motivates new architectures able to handle conformational heterogeneity in IDRs, epistatic interactions, and task-driven generative design under function-specific constraints.
A plausible implication is that standardized, high-diversity benchmarks such as ProteinGym will increasingly define progress in generalizable protein fitness modeling and functional sequence design, aligning predictive architectures with empirical discovery workflows.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free