Improving Protein Optimization with Smoothed Fitness Landscapes (2307.00494v3)

Published 2 Jul 2023 in q-bio.BM, cs.LG, q-bio.QM, and stat.ML

Abstract: The ability to engineer novel proteins with higher fitness for a desired property would be revolutionary for biotechnology and medicine. Modeling the combinatorially large space of sequences is infeasible; prior methods often constrain optimization to a small mutational radius, but this drastically limits the design space. Instead of heuristics, we propose smoothing the fitness landscape to facilitate protein optimization. First, we formulate protein fitness as a graph signal then use Tikunov regularization to smooth the fitness landscape. We find optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks. Second, we achieve state-of-the-art results utilizing discrete energy-based models and MCMC in the smoothed landscape. Our method, called Gibbs sampling with Graph-based Smoothing (GGS), demonstrates a unique ability to achieve 2.5 fold fitness improvement (with in-silico evaluation) over its training set. GGS demonstrates potential to optimize proteins in the limited data regime. Code: https://github.com/kirjner/GGS

References (44)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces Gibbs sampling with Graph-based Smoothing (GGS) that smooths protein fitness landscapes using Tikunov regularization for efficient optimization.
The paper demonstrates that GGS achieves up to five-fold higher fitness in GFP and competitive AAV capsid datasets compared to baseline methods.
The method leverages a graph-based model and gradient-guided sampling to propose improved mutations, promising significant advancements in protein design.

Improving Protein Optimization with Smoothed Fitness Landscapes

The paper "Improving protein optimization with smoothed fitness landscapes" presents a novel approach to protein engineering by addressing the challenges posed by the highly complex and noisy landscape of protein sequences. The authors propose a method termed Gibbs sampling with Graph-based Smoothing (GGS) that leverages Tikunov regularization to smooth the protein fitness landscape, facilitating more efficient and effective optimization of protein sequences for desirable functions.

Methodological Advances

Protein optimization involves the enhancement of proteins for specific traits, such as catalytic activity or fluorescence. Traditional approaches have been hindered by the vast combinatorial space of sequences and the rugged fitness landscapes that result from non-linear interactions among amino acids. The proposed method innovatively circumvents these limitations by conceptualizing protein fitness data as a graph signal and applying regularization techniques to produce a smoothed fitness landscape. Here are the major components of the approach:

Fitness Landscape Smoothing: The fitness landscape is expressed as a graph where nodes are sequences, and edges represent sequence similarity. By applying Tikunov regularization, the authors ensure that the landscape is smoothed, reducing the noise observed in raw experimental data.
Graph-based Model Training: The smoothed fitness values are used to train a predictive model, which is then utilized to infer the fitness of new sequences – an essential step for in silico evaluation.
Sampling with Gibbs and Gradients: The method employs a sampling strategy based on Gibbs With Gradients (GWG). This technique involves proposing new sequences by making mutations likely to increase fitness according to the smoothed model, using information derived from the gradients of the predictive model.

Results and Implications

The authors evaluate GGS on two protein datasets: Green Fluorescent Proteins (GFP) and Adeno-Associated Virus (AAV) capsids, demonstrating state-of-the-art results in increasing the fitness of these proteins compared to previous methodologies. GGS consistently outperformed other baselines, with significant performance improvements observed. For instance, the method achieved up to five-fold higher fitness in GFP optimization tasks than the next best method.

The practical implications of this research are considerable. By improving the ability to navigate protein fitness landscapes, GGS can significantly advance protein design in biotechnology and therapeutic industries, leading to more efficient discovery and engineering processes. The theoretical implications suggest that regularization techniques, typically used in regression contexts, can be effectively adapted to improve fitness landscape modeling in high-dimensional optimization problems.

Future Directions

The proposed approach demonstrates particular efficacy in sparse and noisy data regimes, which are typical in biological datasets. Future work may explore extending this framework with more sophisticated graph structures or incorporate it into a more extensive pipeline involving experimental validation to continually refine the model's predictions. Additionally, further research could investigate the applicability of this technique to other domains of discrete optimization beyond protein engineering.

GGS represents a promising evolution in computational protein design, offering an effective strategy to cope with the challenges inherent in biochemical complexity. As protein engineering becomes increasingly important in diverse domains of science, methodologies like GGS that enhance computational predictions with robust mathematical frameworks will play pivotal roles in translating computational potential into real-world applications.

PDF Markdown

Related Papers

GitHub

GitHub - kirjner/GGS: This repository implements Gibbs sampling with Graph-based Smoothing (38 stars)

Tweets

https://twitter.com/json_yim/status/1798080898824651058

https://twitter.com/AIHealthMIT/status/1750191540650246550

https://twitter.com/mcgovernmit/status/1775528808671842664

https://twitter.com/Pastel/status/1765061309723275271

https://twitter.com/Pastel/status/1754774172859850871

https://twitter.com/josecamoessilva/status/1775649775968457072