Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Protein Optimization with Smoothed Fitness Landscapes (2307.00494v3)

Published 2 Jul 2023 in q-bio.BM, cs.LG, q-bio.QM, and stat.ML

Abstract: The ability to engineer novel proteins with higher fitness for a desired property would be revolutionary for biotechnology and medicine. Modeling the combinatorially large space of sequences is infeasible; prior methods often constrain optimization to a small mutational radius, but this drastically limits the design space. Instead of heuristics, we propose smoothing the fitness landscape to facilitate protein optimization. First, we formulate protein fitness as a graph signal then use Tikunov regularization to smooth the fitness landscape. We find optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks. Second, we achieve state-of-the-art results utilizing discrete energy-based models and MCMC in the smoothed landscape. Our method, called Gibbs sampling with Graph-based Smoothing (GGS), demonstrates a unique ability to achieve 2.5 fold fitness improvement (with in-silico evaluation) over its training set. GGS demonstrates potential to optimize proteins in the limited data regime. Code: https://github.com/kirjner/GGS

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. The adaptive landscape of a metallo-enzyme is shaped by environment-dependent epistasis. Nature Communications, 12(1):3867, 2021.
  2. Model-based reinforcement learning for biological sequence design. 2020.
  3. Frances H Arnold. Design by directed evolution. Accounts of chemical research, 31(3):125–131, 1998.
  4. Frances H Arnold. Directed evolution: bringing new chemistry to life. Angewandte Chemie International Edition, 57(16):4143–4148, 2018.
  5. Low-n protein engineering with data-efficient deep learning. Nature methods, 18(4):389–396, 2021.
  6. Conditioning by adaptive sampling for robust design. In International conference on machine learning, pages 773–782. PMLR, 2019.
  7. On the sparsity of fitness functions and implications for learning. Proceedings of the National Academy of Sciences, 119(1):e2109649118, 2022.
  8. Deep diversification of an aav capsid protein by machine learning. Nature Biotechnology, 39(6):691–696, 2021.
  9. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
  10. Protein design via deep learning. Briefings in bioinformatics, 23(3):bbac102, 2022.
  11. Plug & play directed evolution of proteins with gradient-based discrete mcmc. Machine Learning: Science and Technology, 4(2):025014, 2023.
  12. The expected time to cross extended fitness plateaus. Theoretical Population Biology, 129:54–67, 2019. ISSN 0040-5809. doi: https://doi.org/10.1016/j.tpb.2019.03.008. URL https://www.sciencedirect.com/science/article/pii/S0040580918301011. Special issue in honor of Marcus Feldman’s 75th birthday.
  13. Oops i took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pages 3831–3841. PMLR, 2021.
  14. Illuminating protein space with a programmable generative model. bioRxiv, pages 2022–12, 2022.
  15. Biological sequence design with GFlowNets. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9786–9801. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/jain22a.html.
  16. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  17. Protein sequence design in a latent space via model-based reinforcement learning.
  18. Learning from weak and noisy labels for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(3):486–500, 2016.
  19. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  20. Local latent space bayesian optimization over structured inputs. Advances in Neural Information Processing Systems, 35:34505–34518, 2022.
  21. Machine learning in enzyme engineering. ACS Catalysis, 10(2):1210–1223, 2019.
  22. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
  23. Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378, 2011.
  24. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pages 16990–17017. PMLR, 2022.
  25. Extrapolative controlled sequence generation via iterative refinement. arXiv preprint arXiv:2303.04562, 2023.
  26. Using alphafold to predict the impact of single mutations on protein stability and function. Plos one, 18(3):e0282689, 2023.
  27. Evaluating protein transfer learning with tape. In Advances in Neural Information Processing Systems, 2019.
  28. Msa transformer. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8844–8856. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/rao21a.html.
  29. S James Remington. Green fluorescent protein: a perspective. Protein Science, 20(9):1509–1519, 2011.
  30. Proximal exploration for model-guided protein sequence design. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 18520–18536. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ren22a.html.
  31. Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397–401, 2016.
  32. Adalead: A simple and robust adaptive greedy search algorithm for sequence design. arXiv preprint arXiv:2010.02141, 2020.
  33. Accelerating bayesian optimization for biological sequence design with denoising autoencoders. arXiv preprint arXiv:2203.12742, 2022.
  34. Path auxiliary proposal for mcmc in discrete space. In International Conference on Learning Representations, 2022.
  35. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  36. Conservative objective models for effective offline model-based optimization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10358–10368. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/trabucco21a.html.
  37. Design-bench: Benchmarks for data-driven offline model-based optimization. CoRR, abs/2202.08450, 2022. URL https://arxiv.org/abs/2202.08450.
  38. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. International Conference on Learning Representations, 2023.
  39. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv, 2022.
  40. The reparameterization trick for acquisition functions. arXiv preprint arXiv:1712.00424, 2017.
  41. Se (3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277, 2023.
  42. Giacomo Zanella. Informed proposals for local mcmc in discrete spaces. Journal of the American Statistical Association, 115(530):852–865, 2020.
  43. A langevin-like sampler for discrete distributions. International Conference on Machine Learning, 2022.
  44. Learning with local and global consistency. Advances in neural information processing systems, 16, 2003.
Citations (10)

Summary

  • The paper introduces Gibbs sampling with Graph-based Smoothing (GGS) that smooths protein fitness landscapes using Tikunov regularization for efficient optimization.
  • The paper demonstrates that GGS achieves up to five-fold higher fitness in GFP and competitive AAV capsid datasets compared to baseline methods.
  • The method leverages a graph-based model and gradient-guided sampling to propose improved mutations, promising significant advancements in protein design.

Improving Protein Optimization with Smoothed Fitness Landscapes

The paper "Improving protein optimization with smoothed fitness landscapes" presents a novel approach to protein engineering by addressing the challenges posed by the highly complex and noisy landscape of protein sequences. The authors propose a method termed Gibbs sampling with Graph-based Smoothing (GGS) that leverages Tikunov regularization to smooth the protein fitness landscape, facilitating more efficient and effective optimization of protein sequences for desirable functions.

Methodological Advances

Protein optimization involves the enhancement of proteins for specific traits, such as catalytic activity or fluorescence. Traditional approaches have been hindered by the vast combinatorial space of sequences and the rugged fitness landscapes that result from non-linear interactions among amino acids. The proposed method innovatively circumvents these limitations by conceptualizing protein fitness data as a graph signal and applying regularization techniques to produce a smoothed fitness landscape. Here are the major components of the approach:

  1. Fitness Landscape Smoothing: The fitness landscape is expressed as a graph where nodes are sequences, and edges represent sequence similarity. By applying Tikunov regularization, the authors ensure that the landscape is smoothed, reducing the noise observed in raw experimental data.
  2. Graph-based Model Training: The smoothed fitness values are used to train a predictive model, which is then utilized to infer the fitness of new sequences – an essential step for in silico evaluation.
  3. Sampling with Gibbs and Gradients: The method employs a sampling strategy based on Gibbs With Gradients (GWG). This technique involves proposing new sequences by making mutations likely to increase fitness according to the smoothed model, using information derived from the gradients of the predictive model.

Results and Implications

The authors evaluate GGS on two protein datasets: Green Fluorescent Proteins (GFP) and Adeno-Associated Virus (AAV) capsids, demonstrating state-of-the-art results in increasing the fitness of these proteins compared to previous methodologies. GGS consistently outperformed other baselines, with significant performance improvements observed. For instance, the method achieved up to five-fold higher fitness in GFP optimization tasks than the next best method.

The practical implications of this research are considerable. By improving the ability to navigate protein fitness landscapes, GGS can significantly advance protein design in biotechnology and therapeutic industries, leading to more efficient discovery and engineering processes. The theoretical implications suggest that regularization techniques, typically used in regression contexts, can be effectively adapted to improve fitness landscape modeling in high-dimensional optimization problems.

Future Directions

The proposed approach demonstrates particular efficacy in sparse and noisy data regimes, which are typical in biological datasets. Future work may explore extending this framework with more sophisticated graph structures or incorporate it into a more extensive pipeline involving experimental validation to continually refine the model's predictions. Additionally, further research could investigate the applicability of this technique to other domains of discrete optimization beyond protein engineering.

GGS represents a promising evolution in computational protein design, offering an effective strategy to cope with the challenges inherent in biochemical complexity. As protein engineering becomes increasingly important in diverse domains of science, methodologies like GGS that enhance computational predictions with robust mathematical frameworks will play pivotal roles in translating computational potential into real-world applications.

Github Logo Streamline Icon: https://streamlinehq.com