ProteinGym Benchmark: Protein Mutation Evaluation

Updated 6 July 2025

ProteinGym Benchmark is a comprehensive evaluation framework that systematically assesses computational models for predicting protein mutation effects using deep mutational scanning assays.
It employs key metrics like Spearman correlation and Top 10 Recall to benchmark zero-shot predictions across over 2 million mutants from 217 assays.
Innovative techniques such as OFS pseudo-perplexity and multi-scale integration enhance both evaluation efficiency and predictive accuracy for protein engineering applications.

The ProteinGym Benchmark is a large-scale, standardized evaluation framework designed to assess computational methods for predicting the effects of mutations on protein fitness. It has become a central reference for the protein modeling and machine learning communities, supporting the development and comparison of models that address protein engineering, variant effect prediction, and foundational research in computational protein science.

1. Concept and Dataset Composition

ProteinGym is built around the principle of systematic, high-throughput evaluation of protein mutation effect predictors. At its core, the benchmark centers on assays drawn from deep mutational scanning (DMS) experiments. These assays quantify the effects of amino acid substitutions and, in more recent iterations, insertions and deletions (indels) on protein activity, stability, binding, expression, or organismal fitness.

The primary ProteinGym dataset comprises over 2 million mutants collected across 217 unique DMS assays. Each assay is mapped to a reference sequence (usually UniProt ID), and, where possible, to a matching or predicted three-dimensional structure. The proteins selected capture a wide variety of biological functions and topologies, ensuring coverage and diversity crucial for model benchmarking (Tan et al., 28 Oct 2024, Zhang et al., 2 Dec 2024, Sharma et al., 23 Apr 2025).

2. Evaluation Protocol and Metrics

The benchmark employs zero-shot prediction as its main evaluation paradigm. Models are required to predict the fitness consequences of mutations without task-specific retraining, reflecting real-world scenarios such as variant effect interpretation in clinical and engineering contexts.

Performance is primarily measured using the Spearman rank correlation coefficient ( $\rho$ ) between model-predicted scores and experimentally measured fitness values for a given DMS assay. For model $f$ evaluated on $N$ mutants, this is formalized as:

$\rho = 1 - \frac{6 \sum d_i^2}{N(N^2 - 1)}$

where $d_i$ is the rank difference for the $i$ -th mutant.

A secondary metric, Top 10 Recall, quantifies the enrichment of true high-fitness variants among those ranked highest by the model. Particularly in protein engineering, Top 10 Recall is relevant for screening beneficial mutations in large variant libraries (Sharma et al., 23 Apr 2025).

The most recent benchmarks support evaluation on substitution and indel DMS datasets, enabling assessment of both classic substitution effect predictors and models equipped to handle more complex mutational events (Kantroo et al., 9 Jul 2024).

3. Modeling Paradigms and Comparative Analysis

ProteinGym facilitates comparison among a spectrum of model classes, including:

Protein LLMs (PLMs): Unsupervised models trained on large protein sequence corpora (e.g., ESM-2, CARP), typically evaluated using pseudo-perplexity or masked token log-likelihoods (Kantroo et al., 9 Jul 2024, Zhang et al., 2 Dec 2024).
Structure-based models: Approaches like ESM-IF1 that condition residue recovery likelihoods explicitly on three-dimensional backbone structure (predicted via AlphaFold2 or experimentally obtained) (Sharma et al., 23 Apr 2025).
MSA-based and evolutionary models: Those leveraging multiple sequence alignments or co-evolution statistics.
Multi-modal and ensemble methods: Models such as TranceptEVE or S3F that integrate sequence, structure, evolutionary, and surface features, or simply average the outputs of uni-modal models (Zhang et al., 2 Dec 2024, Sharma et al., 23 Apr 2025).

ProteinGym has established that multi-modal ensembles, which combine probabilities or scores from diverse representations, can yield state-of-the-art performance, outperforming individual uni-modal models (Sharma et al., 23 Apr 2025).

Table: Example Model Types in ProteinGym Benchmark

Model Class	Input Modalities	Example Models
Sequence-only	Sequence	ESM-2, CARP
Structure-based	Sequence + Structure	ESM-IF1, SaProt
MSA/Evolution	Sequence + MSA	GEMME
Ensemble/Multi-modal	Sequence + Structure + Evolution + Surface	TranceptEVE, S3F, ProtREM

4. Technical Developments and Innovations

Several technical advances have been integrated and validated through ProteinGym:

One Fell Swoop (OFS) Pseudo-perplexity: This approach allows the estimation of position-wise masked probabilities in a single forward pass through a PLM by mapping unmasked embeddings via small multi-layer perceptrons. OFS reduces computational load substantially and enables efficient evaluation across large mutant libraries and for indels (Kantroo et al., 9 Jul 2024).
Multi-scale Integration: Approaches such as S3F combine sequence embeddings, backbone structure graphs (processed using Geometric Vector Perceptrons), and fine-grained surface representations (extracted by methods like dMaSIF) to improve prediction accuracy for properties influenced by local geometry and epistatic interactions (Zhang et al., 2 Dec 2024).
Retrieval-enhanced modeling: Models like ProtREM integrate sequence, structure, and evolutionary information via a disentangled multi-head cross-attention mechanism, using explicit tokenization of structure and evolutionary logits retrieved from homology databases (Tan et al., 28 Oct 2024).

Such advances, enabled by ProteinGym’s comprehensive assay coverage and rigorous evaluation, have clarified where and how new modalities and architectures make a tangible difference to mutation effect prediction.

5. Data Quality, Structural Alignment, and Disordered Regions

ProteinGym systematically aligns sequence, mutation, and structure data; however, several challenges are noted:

The dependency on predicted structures (most notably from AlphaFold2) introduces issues in intrinsically disordered regions (IDRs), where structure-based models may be misled by arbitrary or unreliable coordinate predictions. Fitness prediction accuracy in these regions is typically diminished across all model classes (Sharma et al., 23 Apr 2025).
Models mitigate this by masking low-confidence coordinates (e.g., based on pLDDT scores) or using ensemble predictions from multiple structures. Matching the assay’s functional context (e.g., monomeric vs. complexed state) to the input structure is critical for maintaining predictive validity.

The accurate mapping of fitness data to appropriate structural models remains an area of ongoing refinement in the benchmark’s development.

6. Biological and Engineering Insights

Beyond serving as a technical yardstick, ProteinGym has yielded biological insights:

It has highlighted that ancestral protein sequences reconstructed from phylogenetic data are predicted to be more stable (lower pseudo-perplexity) than extant sequences, quantitatively confirming earlier theoretical and experimental claims (Kantroo et al., 9 Jul 2024).
By comparing models on diverse assay types (e.g., stability, activity, binding), the benchmark reveals that the added value of structure or evolutionary information depends on the property being predicted and the underlying data quality.
The realistic Top 10 Recall metric and case studies on real-world protein engineering (e.g., VHH antibody and DNA polymerase design) demonstrate direct applications to enzyme optimization and variant screening (Tan et al., 28 Oct 2024).

7. Relationship to Other Benchmarks and Future Directions

While ProteinGym is focused on mutation effect prediction—primarily in a zero-shot regime—it is now recognized as only one part of a broader ecosystem. Recent benchmarks such as PFMBench (Gao et al., 1 Jun 2025), ProteinBench (Ye et al., 10 Sep 2024), Protap (Yan et al., 1 Jun 2025), and DeepProtein (Xie et al., 2 Oct 2024) expand the scope to dozens of downstream tasks: protein–protein or protein–ligand interaction, annotation, functional and structural property prediction, enzyme catalysis, and targeted degradation.

Comparative studies have found that while ProteinGym is central for zero-shot generalization benchmarking, its rankings and performance metrics do not always correlate well with those from fine-tuning or multi-task evaluations. This suggests that full protein model assessment now requires a suite of benchmarks to capture the generalization and specialization required for future protein science applications (Gao et al., 1 Jun 2025).

A sustained trajectory for ProteinGym and its successors involves continued expansion of DMS coverage (including more complex mutational landscapes), more robust integration of structure and disorder, and alignment with advanced evaluation frameworks that consider user-specific objectives and diverse application scenarios.

In summary, the ProteinGym Benchmark represents a foundational component in the evaluation and development of computational protein fitness predictors, from unsupervised LLMs to advanced multi-modal and structure-aware approaches. Its ongoing evolution and integration with broader benchmarks underpin the rapid progress in computational protein design, interpretation of variation, and the practical deployment of next-generation protein models.