Protein Design with Guided Discrete Diffusion (2305.20009v2)
Abstract: A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to develop guided diffusion models for structure with inverse folding to recover sequences. In this work, we propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models that follows gradients in the hidden states of the denoising network. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods, including scarce data and challenging inverse design. Moreover, we use NOS to generalize LaMBO, a Bayesian optimization procedure for sequence design that facilitates multiple objectives and edit-based constraints. The resulting method, LaMBO-2, enables discrete diffusions and stronger performance with limited edits through a novel application of saliency maps. We apply LaMBO-2 to a real-world protein design task, optimizing antibodies for higher expression yield and binding affinity to several therapeutic targets under locality and developability constraints, attaining a 99% expression rate and 40% binding rate in exploratory in vitro experiments.
- The rosetta all-atom energy function for macromolecular modeling and design. Journal of chemical theory and computation, 13(6):3031–3048, 2017.
- Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
- Charles Audet. A survey on direct search methods for blackbox optimization and their applications. Springer, 2014.
- Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
- How to explain individual classification decisions. The Journal of Machine Learning Research, 11:1803–1831, 2010.
- " will you find these shortcuts?" a protocol for evaluating the faithfulness of input salience methods for text classification. arXiv preprint arXiv:2111.07367, 2021.
- Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
- Generalization in nli: Ways (not) to go beyond simple heuristics, 2021.
- Introduction to protein structure. Garland Science, 2012.
- Protein data bank (pdb): the single global macromolecular structure archive. Protein crystallography: methods and protocols, pages 627–641, 2017.
- Benchmarking interpretability tools for deep neural networks. arXiv preprint arXiv:2302.10894, 2023.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Improving diffusion models for inverse problems using manifold constraints. arXiv preprint arXiv:2206.00941, 2022.
- Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, 2009.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
- Differentiable Expected Hypervolume Improvement for Parallel Multi-Objective Bayesian Optimization. Advances in Neural Information Processing Systems, 33, 2020.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022.
- Theory of evolutionary computation: Recent developments in discrete optimization. 2019.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Anarci: antigen receptor numbering and receptor classification. Bioinformatics, 32(2):298–300, 2016.
- Sabdab: the structural antibody database. Nucleic acids research, 42(D1):D1140–D1146, 2014.
- Plug & play directed evolution of proteins with gradient-based discrete mcmc. Machine Learning: Science and Technology, 4(2):025014, 2023.
- Michael Emmerich. Single-and multi-objective evolutionary design optimization assisted by gaussian random field metamodels. dissertation, Universität Dortmund, 2005.
- Hypervolume-based expected improvement: Monotonicity properties and exact computation. In 2011 IEEE Congress of Evolutionary Computation (CEC), pages 2147–2154. IEEE, 2011.
- Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):1–10, 2022.
- Learning protein family manifolds with smoothed energy-based models. In ICLR 2023 Physics4ML Workshop, 2023. URL https://openreview.net/forum?id=IilnB8jfoP9. Spotlight presentation.
- Mask-predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324, 2019.
- Function-guided protein design by deep manifold sampling. bioRxiv, 2021.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
- Diffusion models as plug-and-play priors. arXiv preprint arXiv:2206.09012, 2022.
- On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
- Her2: biology, detection, and clinical implications. Archives of pathology & laboratory medicine, 135(1):55–62, 2011.
- A high-level programming language for generative protein design. bioRxiv, pages 2022–12, 2022.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
- Evaluating feature importance estimates. web preprint https://research.google/pubs/pub47088/, 2018.
- Machine learning for perturbational single-cell omics. Cell Systems, 12(6):522–537, 2021.
- A penultimate classification of canonical antibody cdr conformations. bioRxiv, pages 2022–10, 2022.
- Construction of a rationally designed antibody platform for sequencing-assisted selection. Proceedings of the National Academy of Sciences, 109(45):18523–18528, 2012.
- Proteinsgm: Score-based generative modeling for de novo protein design. bioRxiv, 2022.
- Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
- Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023a.
- A text-guided protein design framework. arXiv preprint arXiv:2302.04611, 2023b.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
- Antigen-specific antibody design and optimization with diffusion-based generative models. bioRxiv, 2022.
- The application of next generation sequencing to the understanding of antibody repertoires. Frontiers in immunology, 4:265, 2013.
- Benchmarking deep generative models for diverse antibody sequence design. arXiv preprint arXiv:2111.06801, 2021.
- Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4467–4477, 2017.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1):141–146, 2022.
- Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023.
- Extrapolative controlled sequence generation via iterative refinement. arXiv preprint arXiv:2303.04562, 2023.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
- Diffuser: Discrete diffusion via edit-based reconstruction. arXiv preprint arXiv:2210.16886, 2022.
- Proximal exploration for model-guided protein sequence design. In International Conference on Machine Learning, pages 18520–18536. PMLR, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology, 10(12):866–876, 2009.
- Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Biophysical Journal, 121(3):155a–156a, 2022.
- Step-unrolled denoising autoencoders for text generation, 2021.
- Do input gradients highlight discriminative features? Advances in Neural Information Processing Systems, 34:2046–2059, 2021.
- Unlocking de novo antibody design with generative artificial intelligence. bioRxiv, pages 2023–01, 2023.
- Generative language modeling for antibody design. bioRxiv, 2021.
- Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Adalead: A simple and robust adaptive greedy search algorithm for sequence design. arXiv preprint arXiv:2010.02141, 2020.
- Where to diffuse, how to diffuse, and how to get back: Automated learning for multivariate diffusions. arXiv preprint arXiv:2302.07261, 2023.
- Efficient training of low-curvature neural networks. Advances in Neural Information Processing Systems, 35:25951–25964, 2022.
- Accelerating bayesian optimization for biological sequence design with denoising autoencoders. arXiv preprint arXiv:2203.12742, 2022.
- Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022.
- Uniref: comprehensive and non-redundant uniprot reference clusters. Bioinformatics, 23(10):1282–1288, 2007.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
- Language models generalize beyond natural proteins. bioRxiv, 2022.
- Digress: Discrete denoising diffusion for graph generation. arXiv preprint arXiv:2209.14734, 2022.
- Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv, pages 2022–12, 2022.
- Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020.
- The reparameterization trick for acquisition functions. arXiv preprint arXiv:1712.00424, 2017.
- Fudge: Controlled text generation with future discriminators. arXiv preprint arXiv:2104.05218, 2021.