Papers
Topics
Authors
Recent
Search
2000 character limit reached

Kermut: GPR for Protein Variant Effects

Updated 29 May 2026
  • Kermut is a Gaussian process regression model that leverages a composite kernel integrating pre-trained sequence embeddings and local structural features to predict protein variant effects.
  • It employs an analytic GP framework to compute predictive means and uncertainties, achieving state-of-the-art performance on deep mutational scanning assays.
  • Ablation studies and empirical results highlight the importance of combining sequence and structure data for reliable protein engineering and uncertainty quantification.

Kermut is a Gaussian process regression (GPR) model specifically constructed for supervised prediction of protein variant effects, with particular emphasis on providing both state-of-the-art predictive performance and robust uncertainty quantification. It introduces a composite kernel that integrates pre-trained sequence embeddings with explicit local structural environment information, achieving best-in-class accuracy on deep mutational scanning (DMS) assays while delivering rigorous, well-calibrated uncertainty estimates through its GP posterior (Groth et al., 2024).

1. Composite Kernel Construction

Kermut's core methodological innovation lies in its composite kernel, which combines a structure-based kernel and a sequence-based kernel to model the similarity between protein variants. For two variants x\mathbf{x} and x′\mathbf{x}', the full kernel is defined as:

k(x,x′)=π kstruct(x,x′)+(1−π) kseq(x,x′)k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')

where π∈[0,1]\pi\in[0,1] is a learned mixing weight.

  • Structure kernel kstructk_{\rm struct}: Quantifies similarity of mutated sites’ local 3D environments. It evaluates all pairs (i∈M, j∈M′)(i \in M,\,j \in M') of mutated sites by multiplying:
    • kHk_H, a Hellinger kernel over per-site amino-acid distributions computed from inverse-folding (ProteinMPNN),
    • kpk_p, a kernel comparing ProteinMPNN-assigned mutation log-probabilities,
    • kdk_d, an exponential decay in Euclidean (Cα) distance between mutant sites in the wildtype structure.
    • The total structure kernel sums all individual mutation pairings.
  • Sequence kernel kseqk_{\rm seq}: A squared-exponential (Gaussian) kernel applied to mean-pooled ESM-2 embeddings of each sequence, capturing broader context and epistatic information.

This construction allows simultaneous modeling of both local structural context and long-range sequence dependencies, leveraging state-of-the-art representation learning.

2. Gaussian Process Regression Framework

Kermut imposes a zero-mean GP prior over the unknown fitness function:

x′\mathbf{x}'0

Given training data x′\mathbf{x}'1 and Gaussian observation noise, the GP posterior allows analytic computation of predictive mean x′\mathbf{x}'2 and variance x′\mathbf{x}'3 at a test variant x′\mathbf{x}'4:

x′\mathbf{x}'5

where x′\mathbf{x}'6 is the training kernel matrix, x′\mathbf{x}'7 the vector of kernel similarities between training and test.

This closed-form framework provides the basis for both prediction and uncertainty quantification.

3. Uncertainty Quantification and Calibration

The model's posterior variance explicitly quantifies uncertainty for each prediction. Calibration is evaluated using:

Kermut achieves low ECEs and a positive correlation between predicted uncertainties and observed errors in aggregate, indicating reliable overall uncertainty quantification. However, instance-level uncertainty is more variable, with per-instance uncertainties exhibiting noise (revealed by ENCE and coefficient of variation metrics). For comparison, MC-dropout as an uncertainty estimator in deep neural architectures (e.g., ProteinNPT) yields over-confident and under-calibrated uncertainties, highlighting an advantage of GPR posteriors for uncertainty estimation without post-hoc calibration.

4. Implementation Details and Practical Workflow

Kermut involves the following pre-processing and modelling pipeline:

  • Variants are described relative to a wildtype sequence, with mutated sites one-hot encoded and structurally encoded via inverse-folding (ProteinMPNN) to yield per-site amino-acid distributions and mutation log-probabilities.
  • The wildtype structure (e.g., from PDB or AlphaFold2) is used to calculate Cα distances.
  • Sequence-level embeddings are extracted via mean-pooling the final hidden layer of ESM-2 (1280 dimensions).
  • Key hyperparameters include kernel mixing weight (x′\mathbf{x}'8), structure kernel scale (x′\mathbf{x}'9), three structure kernel length-scales (k(x,x′)=π kstruct(x,x′)+(1−π) kseq(x,x′)k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')0), sequence kernel width (k(x,x′)=π kstruct(x,x′)+(1−π) kseq(x,x′)k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')1), GP noise variance, and optional zero-shot mean parameters.
  • All hyperparameters are tuned by maximizing the GP marginal likelihood (type II maximum likelihood).
  • Posterior inference is carried out by Cholesky factorization in k(x,x′)=π kstruct(x,x′)+(1−π) kseq(x,x′)k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')2, typically tractable for DMS experiments where k(x,x′)=π kstruct(x,x′)+(1−π) kseq(x,x′)k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')3 ranges from k(x,x′)=π kstruct(x,x′)+(1−π) kseq(x,x′)k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')4 to k(x,x′)=π kstruct(x,x′)+(1−π) kseq(x,x′)k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')5. For larger datasets, sparse or inducing-point approximations are suggested as possible extensions.

5. Empirical Performance and Ablation Analysis

Kermut is benchmarked on 217 DMS assays from the ProteinGym dataset, across three train/test splitting strategies: random, modulo (every 5th position withheld), and contiguous (contiguous slice withheld). Performance, as measured by median Spearman correlation and MSE, is as follows (mean over splits):

Method Contiguous Modulo Random Average (Spearman)
Kermut 0.610 0.633 0.744 0.662
ProteinNPT 0.547 0.564 0.730 0.613

Among tested baselines (One-Hot, ESM-1v, DeepSequence, MSAT, TranceptEVE, ProteinNPT, ESM-1v/MSAT+MLP), Kermut demonstrates superior or best-in-class accuracy for all splits. MSE is similarly improved.

Ablation studies quantify kernel component contributions. Removing the structure kernel reduces Spearman correlation by approximately 0.065, removing sequence kernel by 0.045, omitting inter-residue distance (k(x,x′)=π kstruct(x,x′)+(1−π) kseq(x,x′)k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')6) by 0.035, and various other kernel subcomponents by smaller but still substantial margins. The structure kernel is particularly important for challenging splits, while the sequence kernel consistently improves performance. The zero-shot mean yields marginal, but positive, benefits.

6. Strengths, Limitations, and Practical Guidance

Strengths

  • Composite modeling: Integrates sequence-level (pretrained LLM) and site-specific structure information in a principled, modular kernel.
  • Analytic uncertainty: Provides closed-form uncertainty quantification, outperforming post-hoc UQ methods (e.g., MC-dropout) in calibration.
  • Empirical accuracy: Achieves state-of-the-art performance on large, diverse DMS datasets under multiple train/test splits.
  • Efficiency: Training and inference are tractable for several thousand variants, requiring minutes rather than hours associated with deep model fine-tuning.

Limitations

  • Structural prerequisites: Requires a fixed wildtype structure; not applicable to insertions/deletions.
  • Epistasis modeling: Limited to additive pairwise kernel components; does not capture higher-order epistatic interactions. Extrapolation to highly mutated variants may be degraded.
  • Computational scaling: Cubic in data size (k(x,x′)=π kstruct(x,x′)+(1−π) kseq(x,x′)k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')7); mitigated for modest DMS datasets but potentially limiting for very large panels.

Recommendations

  • Use Kermut as a baseline in supervised DMS prediction where both sequence and structure are available.
  • Employ the GP posterior uncertainty in Bayesian experimental design and active learning to prioritize protein variant assays.
  • Optimize hyperparameters via marginal likelihood; alternatives such as domain-specific priors (e.g., half-t on length-scales) may improve performance or interpretability in some cases.
  • Extend or adapt kernel components as new pretrained architectures or physics-based features become available.

7. Significance and Future Directions

Kermut demonstrates that a carefully constructed Gaussian process with a composite kernel can simultaneously deliver accurate predictions and robust, interpretable uncertainty estimates in protein variant effect prediction. The method's modular kernel design facilitates principled ablation, interpretation, and future extension. As advanced pretrained protein language and structure models continue to emerge, analogous composite kernels are expected to further enhance model expressivity and practical impact in protein engineering and computational biology (Groth et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kermut.