Improving Protein Optimization with Smoothed Fitness Landscapes (2307.00494v3)
Abstract: The ability to engineer novel proteins with higher fitness for a desired property would be revolutionary for biotechnology and medicine. Modeling the combinatorially large space of sequences is infeasible; prior methods often constrain optimization to a small mutational radius, but this drastically limits the design space. Instead of heuristics, we propose smoothing the fitness landscape to facilitate protein optimization. First, we formulate protein fitness as a graph signal then use Tikunov regularization to smooth the fitness landscape. We find optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks. Second, we achieve state-of-the-art results utilizing discrete energy-based models and MCMC in the smoothed landscape. Our method, called Gibbs sampling with Graph-based Smoothing (GGS), demonstrates a unique ability to achieve 2.5 fold fitness improvement (with in-silico evaluation) over its training set. GGS demonstrates potential to optimize proteins in the limited data regime. Code: https://github.com/kirjner/GGS
- The adaptive landscape of a metallo-enzyme is shaped by environment-dependent epistasis. Nature Communications, 12(1):3867, 2021.
- Model-based reinforcement learning for biological sequence design. 2020.
- Frances H Arnold. Design by directed evolution. Accounts of chemical research, 31(3):125–131, 1998.
- Frances H Arnold. Directed evolution: bringing new chemistry to life. Angewandte Chemie International Edition, 57(16):4143–4148, 2018.
- Low-n protein engineering with data-efficient deep learning. Nature methods, 18(4):389–396, 2021.
- Conditioning by adaptive sampling for robust design. In International conference on machine learning, pages 773–782. PMLR, 2019.
- On the sparsity of fitness functions and implications for learning. Proceedings of the National Academy of Sciences, 119(1):e2109649118, 2022.
- Deep diversification of an aav capsid protein by machine learning. Nature Biotechnology, 39(6):691–696, 2021.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
- Protein design via deep learning. Briefings in bioinformatics, 23(3):bbac102, 2022.
- Plug & play directed evolution of proteins with gradient-based discrete mcmc. Machine Learning: Science and Technology, 4(2):025014, 2023.
- The expected time to cross extended fitness plateaus. Theoretical Population Biology, 129:54–67, 2019. ISSN 0040-5809. doi: https://doi.org/10.1016/j.tpb.2019.03.008. URL https://www.sciencedirect.com/science/article/pii/S0040580918301011. Special issue in honor of Marcus Feldman’s 75th birthday.
- Oops i took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pages 3831–3841. PMLR, 2021.
- Illuminating protein space with a programmable generative model. bioRxiv, pages 2022–12, 2022.
- Biological sequence design with GFlowNets. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9786–9801. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/jain22a.html.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Protein sequence design in a latent space via model-based reinforcement learning.
- Learning from weak and noisy labels for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(3):486–500, 2016.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Local latent space bayesian optimization over structured inputs. Advances in Neural Information Processing Systems, 35:34505–34518, 2022.
- Machine learning in enzyme engineering. ACS Catalysis, 10(2):1210–1223, 2019.
- Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
- Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378, 2011.
- Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pages 16990–17017. PMLR, 2022.
- Extrapolative controlled sequence generation via iterative refinement. arXiv preprint arXiv:2303.04562, 2023.
- Using alphafold to predict the impact of single mutations on protein stability and function. Plos one, 18(3):e0282689, 2023.
- Evaluating protein transfer learning with tape. In Advances in Neural Information Processing Systems, 2019.
- Msa transformer. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8844–8856. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/rao21a.html.
- S James Remington. Green fluorescent protein: a perspective. Protein Science, 20(9):1509–1519, 2011.
- Proximal exploration for model-guided protein sequence design. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 18520–18536. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ren22a.html.
- Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397–401, 2016.
- Adalead: A simple and robust adaptive greedy search algorithm for sequence design. arXiv preprint arXiv:2010.02141, 2020.
- Accelerating bayesian optimization for biological sequence design with denoising autoencoders. arXiv preprint arXiv:2203.12742, 2022.
- Path auxiliary proposal for mcmc in discrete space. In International Conference on Learning Representations, 2022.
- Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- Conservative objective models for effective offline model-based optimization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10358–10368. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/trabucco21a.html.
- Design-bench: Benchmarks for data-driven offline model-based optimization. CoRR, abs/2202.08450, 2022. URL https://arxiv.org/abs/2202.08450.
- Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. International Conference on Learning Representations, 2023.
- Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv, 2022.
- The reparameterization trick for acquisition functions. arXiv preprint arXiv:1712.00424, 2017.
- Se (3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277, 2023.
- Giacomo Zanella. Informed proposals for local mcmc in discrete spaces. Journal of the American Statistical Association, 115(530):852–865, 2020.
- A langevin-like sampler for discrete distributions. International Conference on Machine Learning, 2022.
- Learning with local and global consistency. Advances in neural information processing systems, 16, 2003.