Robust Model-Based Optimization for Challenging Fitness Landscapes (2305.13650v3)
Abstract: Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.
- Improving catalytic function by prosar-driven enzyme evolution. Nature biotechnology, 25(3):338–344, 2007.
- Recent advances in (therapeutic protein) drug development. F1000Research, 6, 2017.
- Low-n protein engineering with data-efficient deep learning. Nature methods, 18(4):389–396, 2021.
- Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology, 10(12):866–876, 2009.
- Frances H Arnold. Design by directed evolution. Accounts of chemical research, 31(3):125–131, 1998.
- Toward machine-guided design of proteins. BioRxiv, page 337154, 2018.
- Design by adaptive sampling. arXiv preprint arXiv:1810.03714, 2018.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
- Conditioning by adaptive sampling for robust design. In International conference on machine learning, pages 773–782. PMLR, 2019.
- Model-based reinforcement learning for biological sequence design. In International conference on learning representations, 2019.
- Adalead: A simple and robust adaptive greedy search algorithm for sequence design. arXiv preprint arXiv:2010.02141, 2020.
- Proximal exploration for model-guided protein sequence design. In International Conference on Machine Learning, pages 18520–18536. PMLR, 2022.
- J. Mockus. Bayeisan approach to global optimization: theory and applications. Springer Science & Business Media, volume 37, 2012.
- Navigating the protein fitness landscape with gaussian processes. Proceedings of the National Academy of Sciences, 110(3):E193–E201, 2013.
- Bayesian optimization for synthetic gene design. arXiv preprint arXiv:1505.01627, 2015.
- Machine-learning-guided directed evolution for protein engineering. Nature methods, 16(8):687–694, 2019.
- Adaptive machine learning for protein engineering. Current opinion in structural biology, 72:145–152, 2022.
- Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, 2014.
- Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
- Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
- High-dimensional gaussian process bandits. Advances in neural information processing systems, 26, 2013.
- Feedback gan for dna optimizes protein functions. Nature Machine Intelligence, 1(2):105–111, 2019.
- Conditional molecular design with deep generative models. Journal of chemical information and modeling, 59(1):43–52, 2018.
- Property controllable variational autoencoder via invertible mutual dependence. In International Conference on Learning Representations, 2020.
- Deep extrapolation for attribute-enhanced generation. Advances in Neural Information Processing Systems, 34:14084–14096, 2021.
- Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
- Autofocused oracles for model-based design. Advances in Neural Information Processing Systems, 33:12945–12956, 2020.
- Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
- Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.
- Reuven Y Rubinstein. Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112, 1997.
- Reuven Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1:127–190, 1999.
- Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- Machine learning for protein engineering. arXiv preprint arXiv:2305.16634, 2023.
- Deep diversification of an aav capsid protein by machine learning. Nature Biotechnology, 39(6):691–696, 2021.
- Generative aav capsid diversification by latent interpolation. bioRxiv, pages 2021–04, 2021.
- Machine learning identification of capsid mutations to improve aav production fitness. bioRxiv, pages 2021–06, 2021.
- Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife, 5:e16965, 2016.
- Pervasive degeneracy and epistasis in a protein-protein interface. Science, 347(6222):673–677, 2015.
- Design-bench: Benchmarks for data-driven offline model-based optimization. In International Conference on Machine Learning, pages 21658–21676. PMLR, 2022.
- Model inversion networks for model-based optimization. Advances in Neural Information Processing Systems, 33:5126–5137, 2020.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Inference from complex samples. Journal of the Royal Statistical Society: Series B (Methodological), 36(1):1–22, 1974.
- Art B. Owen. Monte Carlo theory, methods and examples. 2013.
- David Firth. Bias reduction of maximum likelihood estimates. Biometrika, 80(1):27–38, 1993.
- On the importance of firth bias reduction in few-shot classification. International Conference on Learning Representations, 2022.
- Flip: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, pages 2021–11, 2021.
- Clade 2.0: Evolution-driven cluster learning-assisted directed evolution. Journal of Chemical Information and Modeling, 62(19):4629–4641, 2022.
- Learning from integral losses in physics informed neural networks. arXiv preprint arXiv:2305.17387, 2023.
- Saba Ghaffari (4 papers)
- Ehsan Saleh (4 papers)
- Alexander G. Schwing (62 papers)
- Yu-Xiong Wang (87 papers)
- Martin D. Burke (4 papers)
- Saurabh Sinha (25 papers)