Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Model-Based Optimization for Challenging Fitness Landscapes (2305.13650v3)

Published 23 May 2023 in cs.LG and cs.AI

Abstract: Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Improving catalytic function by prosar-driven enzyme evolution. Nature biotechnology, 25(3):338–344, 2007.
  2. Recent advances in (therapeutic protein) drug development. F1000Research, 6, 2017.
  3. Low-n protein engineering with data-efficient deep learning. Nature methods, 18(4):389–396, 2021.
  4. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology, 10(12):866–876, 2009.
  5. Frances H Arnold. Design by directed evolution. Accounts of chemical research, 31(3):125–131, 1998.
  6. Toward machine-guided design of proteins. BioRxiv, page 337154, 2018.
  7. Design by adaptive sampling. arXiv preprint arXiv:1810.03714, 2018.
  8. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
  9. Conditioning by adaptive sampling for robust design. In International conference on machine learning, pages 773–782. PMLR, 2019.
  10. Model-based reinforcement learning for biological sequence design. In International conference on learning representations, 2019.
  11. Adalead: A simple and robust adaptive greedy search algorithm for sequence design. arXiv preprint arXiv:2010.02141, 2020.
  12. Proximal exploration for model-guided protein sequence design. In International Conference on Machine Learning, pages 18520–18536. PMLR, 2022.
  13. J. Mockus. Bayeisan approach to global optimization: theory and applications. Springer Science & Business Media, volume 37, 2012.
  14. Navigating the protein fitness landscape with gaussian processes. Proceedings of the National Academy of Sciences, 110(3):E193–E201, 2013.
  15. Bayesian optimization for synthetic gene design. arXiv preprint arXiv:1505.01627, 2015.
  16. Machine-learning-guided directed evolution for protein engineering. Nature methods, 16(8):687–694, 2019.
  17. Adaptive machine learning for protein engineering. Current opinion in structural biology, 72:145–152, 2022.
  18. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, 2014.
  19. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  20. Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
  21. High-dimensional gaussian process bandits. Advances in neural information processing systems, 26, 2013.
  22. Feedback gan for dna optimizes protein functions. Nature Machine Intelligence, 1(2):105–111, 2019.
  23. Conditional molecular design with deep generative models. Journal of chemical information and modeling, 59(1):43–52, 2018.
  24. Property controllable variational autoencoder via invertible mutual dependence. In International Conference on Learning Representations, 2020.
  25. Deep extrapolation for attribute-enhanced generation. Advances in Neural Information Processing Systems, 34:14084–14096, 2021.
  26. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
  27. Autofocused oracles for model-based design. Advances in Neural Information Processing Systems, 33:12945–12956, 2020.
  28. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
  29. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.
  30. Reuven Y Rubinstein. Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112, 1997.
  31. Reuven Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1:127–190, 1999.
  32. Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  33. Machine learning for protein engineering. arXiv preprint arXiv:2305.16634, 2023.
  34. Deep diversification of an aav capsid protein by machine learning. Nature Biotechnology, 39(6):691–696, 2021.
  35. Generative aav capsid diversification by latent interpolation. bioRxiv, pages 2021–04, 2021.
  36. Machine learning identification of capsid mutations to improve aav production fitness. bioRxiv, pages 2021–06, 2021.
  37. Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife, 5:e16965, 2016.
  38. Pervasive degeneracy and epistasis in a protein-protein interface. Science, 347(6222):673–677, 2015.
  39. Design-bench: Benchmarks for data-driven offline model-based optimization. In International Conference on Machine Learning, pages 21658–21676. PMLR, 2022.
  40. Model inversion networks for model-based optimization. Advances in Neural Information Processing Systems, 33:5126–5137, 2020.
  41. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  42. Inference from complex samples. Journal of the Royal Statistical Society: Series B (Methodological), 36(1):1–22, 1974.
  43. Art B. Owen. Monte Carlo theory, methods and examples. 2013.
  44. David Firth. Bias reduction of maximum likelihood estimates. Biometrika, 80(1):27–38, 1993.
  45. On the importance of firth bias reduction in few-shot classification. International Conference on Learning Representations, 2022.
  46. Flip: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, pages 2021–11, 2021.
  47. Clade 2.0: Evolution-driven cluster learning-assisted directed evolution. Journal of Chemical Information and Modeling, 62(19):4629–4641, 2022.
  48. Learning from integral losses in physics informed neural networks. arXiv preprint arXiv:2305.17387, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Saba Ghaffari (4 papers)
  2. Ehsan Saleh (4 papers)
  3. Alexander G. Schwing (62 papers)
  4. Yu-Xiong Wang (87 papers)
  5. Martin D. Burke (4 papers)
  6. Saurabh Sinha (25 papers)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com