Kermut: GPR for Protein Variant Effects
- Kermut is a Gaussian process regression model that leverages a composite kernel integrating pre-trained sequence embeddings and local structural features to predict protein variant effects.
- It employs an analytic GP framework to compute predictive means and uncertainties, achieving state-of-the-art performance on deep mutational scanning assays.
- Ablation studies and empirical results highlight the importance of combining sequence and structure data for reliable protein engineering and uncertainty quantification.
Kermut is a Gaussian process regression (GPR) model specifically constructed for supervised prediction of protein variant effects, with particular emphasis on providing both state-of-the-art predictive performance and robust uncertainty quantification. It introduces a composite kernel that integrates pre-trained sequence embeddings with explicit local structural environment information, achieving best-in-class accuracy on deep mutational scanning (DMS) assays while delivering rigorous, well-calibrated uncertainty estimates through its GP posterior (Groth et al., 2024).
1. Composite Kernel Construction
Kermut's core methodological innovation lies in its composite kernel, which combines a structure-based kernel and a sequence-based kernel to model the similarity between protein variants. For two variants and , the full kernel is defined as:
where is a learned mixing weight.
- Structure kernel : Quantifies similarity of mutated sites’ local 3D environments. It evaluates all pairs of mutated sites by multiplying:
- , a Hellinger kernel over per-site amino-acid distributions computed from inverse-folding (ProteinMPNN),
- , a kernel comparing ProteinMPNN-assigned mutation log-probabilities,
- , an exponential decay in Euclidean (Cα) distance between mutant sites in the wildtype structure.
- The total structure kernel sums all individual mutation pairings.
- Sequence kernel : A squared-exponential (Gaussian) kernel applied to mean-pooled ESM-2 embeddings of each sequence, capturing broader context and epistatic information.
This construction allows simultaneous modeling of both local structural context and long-range sequence dependencies, leveraging state-of-the-art representation learning.
2. Gaussian Process Regression Framework
Kermut imposes a zero-mean GP prior over the unknown fitness function:
0
Given training data 1 and Gaussian observation noise, the GP posterior allows analytic computation of predictive mean 2 and variance 3 at a test variant 4:
5
where 6 is the training kernel matrix, 7 the vector of kernel similarities between training and test.
This closed-form framework provides the basis for both prediction and uncertainty quantification.
3. Uncertainty Quantification and Calibration
The model's posterior variance explicitly quantifies uncertainty for each prediction. Calibration is evaluated using:
- Expected calibration error (ECE): Assesses coverage accuracy of computed confidence intervals via reliability diagrams.
- Expected normalized calibration error (ENCE): Directly compares empirical errors and predicted uncertainties.
Kermut achieves low ECEs and a positive correlation between predicted uncertainties and observed errors in aggregate, indicating reliable overall uncertainty quantification. However, instance-level uncertainty is more variable, with per-instance uncertainties exhibiting noise (revealed by ENCE and coefficient of variation metrics). For comparison, MC-dropout as an uncertainty estimator in deep neural architectures (e.g., ProteinNPT) yields over-confident and under-calibrated uncertainties, highlighting an advantage of GPR posteriors for uncertainty estimation without post-hoc calibration.
4. Implementation Details and Practical Workflow
Kermut involves the following pre-processing and modelling pipeline:
- Variants are described relative to a wildtype sequence, with mutated sites one-hot encoded and structurally encoded via inverse-folding (ProteinMPNN) to yield per-site amino-acid distributions and mutation log-probabilities.
- The wildtype structure (e.g., from PDB or AlphaFold2) is used to calculate Cα distances.
- Sequence-level embeddings are extracted via mean-pooling the final hidden layer of ESM-2 (1280 dimensions).
- Key hyperparameters include kernel mixing weight (8), structure kernel scale (9), three structure kernel length-scales (0), sequence kernel width (1), GP noise variance, and optional zero-shot mean parameters.
- All hyperparameters are tuned by maximizing the GP marginal likelihood (type II maximum likelihood).
- Posterior inference is carried out by Cholesky factorization in 2, typically tractable for DMS experiments where 3 ranges from 4 to 5. For larger datasets, sparse or inducing-point approximations are suggested as possible extensions.
5. Empirical Performance and Ablation Analysis
Kermut is benchmarked on 217 DMS assays from the ProteinGym dataset, across three train/test splitting strategies: random, modulo (every 5th position withheld), and contiguous (contiguous slice withheld). Performance, as measured by median Spearman correlation and MSE, is as follows (mean over splits):
| Method | Contiguous | Modulo | Random | Average (Spearman) |
|---|---|---|---|---|
| Kermut | 0.610 | 0.633 | 0.744 | 0.662 |
| ProteinNPT | 0.547 | 0.564 | 0.730 | 0.613 |
Among tested baselines (One-Hot, ESM-1v, DeepSequence, MSAT, TranceptEVE, ProteinNPT, ESM-1v/MSAT+MLP), Kermut demonstrates superior or best-in-class accuracy for all splits. MSE is similarly improved.
Ablation studies quantify kernel component contributions. Removing the structure kernel reduces Spearman correlation by approximately 0.065, removing sequence kernel by 0.045, omitting inter-residue distance (6) by 0.035, and various other kernel subcomponents by smaller but still substantial margins. The structure kernel is particularly important for challenging splits, while the sequence kernel consistently improves performance. The zero-shot mean yields marginal, but positive, benefits.
6. Strengths, Limitations, and Practical Guidance
Strengths
- Composite modeling: Integrates sequence-level (pretrained LLM) and site-specific structure information in a principled, modular kernel.
- Analytic uncertainty: Provides closed-form uncertainty quantification, outperforming post-hoc UQ methods (e.g., MC-dropout) in calibration.
- Empirical accuracy: Achieves state-of-the-art performance on large, diverse DMS datasets under multiple train/test splits.
- Efficiency: Training and inference are tractable for several thousand variants, requiring minutes rather than hours associated with deep model fine-tuning.
Limitations
- Structural prerequisites: Requires a fixed wildtype structure; not applicable to insertions/deletions.
- Epistasis modeling: Limited to additive pairwise kernel components; does not capture higher-order epistatic interactions. Extrapolation to highly mutated variants may be degraded.
- Computational scaling: Cubic in data size (7); mitigated for modest DMS datasets but potentially limiting for very large panels.
Recommendations
- Use Kermut as a baseline in supervised DMS prediction where both sequence and structure are available.
- Employ the GP posterior uncertainty in Bayesian experimental design and active learning to prioritize protein variant assays.
- Optimize hyperparameters via marginal likelihood; alternatives such as domain-specific priors (e.g., half-t on length-scales) may improve performance or interpretability in some cases.
- Extend or adapt kernel components as new pretrained architectures or physics-based features become available.
7. Significance and Future Directions
Kermut demonstrates that a carefully constructed Gaussian process with a composite kernel can simultaneously deliver accurate predictions and robust, interpretable uncertainty estimates in protein variant effect prediction. The method's modular kernel design facilitates principled ablation, interpretation, and future extension. As advanced pretrained protein language and structure models continue to emerge, analogous composite kernels are expected to further enhance model expressivity and practical impact in protein engineering and computational biology (Groth et al., 2024).