Kermut: GPR for Protein Variant Effects

Updated 29 May 2026

Kermut is a Gaussian process regression model that leverages a composite kernel integrating pre-trained sequence embeddings and local structural features to predict protein variant effects.
It employs an analytic GP framework to compute predictive means and uncertainties, achieving state-of-the-art performance on deep mutational scanning assays.
Ablation studies and empirical results highlight the importance of combining sequence and structure data for reliable protein engineering and uncertainty quantification.

Kermut is a Gaussian process regression (GPR) model specifically constructed for supervised prediction of protein variant effects, with particular emphasis on providing both state-of-the-art predictive performance and robust uncertainty quantification. It introduces a composite kernel that integrates pre-trained sequence embeddings with explicit local structural environment information, achieving best-in-class accuracy on deep mutational scanning (DMS) assays while delivering rigorous, well-calibrated uncertainty estimates through its GP posterior (Groth et al., 2024).

1. Composite Kernel Construction

Kermut's core methodological innovation lies in its composite kernel, which combines a structure-based kernel and a sequence-based kernel to model the similarity between protein variants. For two variants $\mathbf{x}$ and $\mathbf{x}'$ , the full kernel is defined as:

$k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')$

where $\pi\in[0,1]$ is a learned mixing weight.

Structure kernel $k_{\rm struct}$ : Quantifies similarity of mutated sites’ local 3D environments. It evaluates all pairs $(i \in M,\,j \in M')$ $(i \in M, j \in M^{'})$ of mutated sites by multiplying:
- $k_H$ , a Hellinger kernel over per-site amino-acid distributions computed from inverse-folding (ProteinMPNN),
- $k_p$ , a kernel comparing ProteinMPNN-assigned mutation log-probabilities,
- $k_d$ , an exponential decay in Euclidean (Cα) distance between mutant sites in the wildtype structure.
- The total structure kernel sums all individual mutation pairings.
Sequence kernel $k_{\rm seq}$ : A squared-exponential (Gaussian) kernel applied to mean-pooled ESM-2 embeddings of each sequence, capturing broader context and epistatic information.

This construction allows simultaneous modeling of both local structural context and long-range sequence dependencies, leveraging state-of-the-art representation learning.

2. Gaussian Process Regression Framework

Kermut imposes a zero-mean GP prior over the unknown fitness function:

$\mathbf{x}'$ 0

Given training data $\mathbf{x}'$ 1 and Gaussian observation noise, the GP posterior allows analytic computation of predictive mean $\mathbf{x}'$ 2 and variance $\mathbf{x}'$ 3 at a test variant $\mathbf{x}'$ 4:

$\mathbf{x}'$ 5

where $\mathbf{x}'$ 6 is the training kernel matrix, $\mathbf{x}'$ 7 the vector of kernel similarities between training and test.

This closed-form framework provides the basis for both prediction and uncertainty quantification.

3. Uncertainty Quantification and Calibration

The model's posterior variance explicitly quantifies uncertainty for each prediction. Calibration is evaluated using:

Expected calibration error (ECE): Assesses coverage accuracy of computed confidence intervals via reliability diagrams.
Expected normalized calibration error (ENCE): Directly compares empirical errors and predicted uncertainties.

Kermut achieves low ECEs and a positive correlation between predicted uncertainties and observed errors in aggregate, indicating reliable overall uncertainty quantification. However, instance-level uncertainty is more variable, with per-instance uncertainties exhibiting noise (revealed by ENCE and coefficient of variation metrics). For comparison, MC-dropout as an uncertainty estimator in deep neural architectures (e.g., ProteinNPT) yields over-confident and under-calibrated uncertainties, highlighting an advantage of GPR posteriors for uncertainty estimation without post-hoc calibration.

4. Implementation Details and Practical Workflow

Kermut involves the following pre-processing and modelling pipeline:

Variants are described relative to a wildtype sequence, with mutated sites one-hot encoded and structurally encoded via inverse-folding (ProteinMPNN) to yield per-site amino-acid distributions and mutation log-probabilities.
The wildtype structure (e.g., from PDB or AlphaFold2) is used to calculate Cα distances.
Sequence-level embeddings are extracted via mean-pooling the final hidden layer of ESM-2 (1280 dimensions).
Key hyperparameters include kernel mixing weight ( $\mathbf{x}'$ 8), structure kernel scale ( $\mathbf{x}'$ 9), three structure kernel length-scales ( $k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')$ 0), sequence kernel width ( $k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')$ 1), GP noise variance, and optional zero-shot mean parameters.
All hyperparameters are tuned by maximizing the GP marginal likelihood (type II maximum likelihood).
Posterior inference is carried out by Cholesky factorization in $k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')$ 2, typically tractable for DMS experiments where $k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')$ 3 ranges from $k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')$ 4 to $k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')$ 5. For larger datasets, sparse or inducing-point approximations are suggested as possible extensions.

5. Empirical Performance and Ablation Analysis

Kermut is benchmarked on 217 DMS assays from the ProteinGym dataset, across three train/test splitting strategies: random, modulo (every 5th position withheld), and contiguous (contiguous slice withheld). Performance, as measured by median Spearman correlation and MSE, is as follows (mean over splits):

Method	Contiguous	Modulo	Random	Average (Spearman)
Kermut	0.610	0.633	0.744	0.662
ProteinNPT	0.547	0.564	0.730	0.613

Among tested baselines (One-Hot, ESM-1v, DeepSequence, MSAT, TranceptEVE, ProteinNPT, ESM-1v/MSAT+MLP), Kermut demonstrates superior or best-in-class accuracy for all splits. MSE is similarly improved.

Ablation studies quantify kernel component contributions. Removing the structure kernel reduces Spearman correlation by approximately 0.065, removing sequence kernel by 0.045, omitting inter-residue distance ( $k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')$ 6) by 0.035, and various other kernel subcomponents by smaller but still substantial margins. The structure kernel is particularly important for challenging splits, while the sequence kernel consistently improves performance. The zero-shot mean yields marginal, but positive, benefits.

6. Strengths, Limitations, and Practical Guidance

Strengths

Composite modeling: Integrates sequence-level (pretrained LLM) and site-specific structure information in a principled, modular kernel.
Analytic uncertainty: Provides closed-form uncertainty quantification, outperforming post-hoc UQ methods (e.g., MC-dropout) in calibration.
Empirical accuracy: Achieves state-of-the-art performance on large, diverse DMS datasets under multiple train/test splits.
Efficiency: Training and inference are tractable for several thousand variants, requiring minutes rather than hours associated with deep model fine-tuning.

Limitations

Structural prerequisites: Requires a fixed wildtype structure; not applicable to insertions/deletions.
Epistasis modeling: Limited to additive pairwise kernel components; does not capture higher-order epistatic interactions. Extrapolation to highly mutated variants may be degraded.
Computational scaling: Cubic in data size ( $k(\mathbf{x},\mathbf{x}') = \pi\,k_{\rm struct}(\mathbf{x},\mathbf{x}') + (1 - \pi)\,k_{\rm seq}(\mathbf{x},\mathbf{x}')$ 7); mitigated for modest DMS datasets but potentially limiting for very large panels.

Recommendations

Use Kermut as a baseline in supervised DMS prediction where both sequence and structure are available.
Employ the GP posterior uncertainty in Bayesian experimental design and active learning to prioritize protein variant assays.
Optimize hyperparameters via marginal likelihood; alternatives such as domain-specific priors (e.g., half-t on length-scales) may improve performance or interpretability in some cases.
Extend or adapt kernel components as new pretrained architectures or physics-based features become available.

7. Significance and Future Directions

Kermut demonstrates that a carefully constructed Gaussian process with a composite kernel can simultaneously deliver accurate predictions and robust, interpretable uncertainty estimates in protein variant effect prediction. The method's modular kernel design facilitates principled ablation, interpretation, and future extension. As advanced pretrained protein language and structure models continue to emerge, analogous composite kernels are expected to further enhance model expressivity and practical impact in protein engineering and computational biology (Groth et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Kermut: Composite kernel regression for protein variant effects (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kermut.

Kermut: GPR for Protein Variant Effects

1. Composite Kernel Construction

2. Gaussian Process Regression Framework

3. Uncertainty Quantification and Calibration

4. Implementation Details and Practical Workflow

5. Empirical Performance and Ablation Analysis

6. Strengths, Limitations, and Practical Guidance

Strengths

Limitations

Recommendations

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Kermut: GPR for Protein Variant Effects

1. Composite Kernel Construction

2. Gaussian Process Regression Framework

3. Uncertainty Quantification and Calibration

4. Implementation Details and Practical Workflow

5. Empirical Performance and Ablation Analysis

6. Strengths, Limitations, and Practical Guidance

Strengths

Limitations

Recommendations

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research