Protein Discovery with Discrete Walk-Jump Sampling (2306.12360v2)
Abstract: We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain.
- What regularized auto-encoders learn from the data-generating distribution. Journal of Machine Learning Research, 15(1):3563–3593, 2014.
- Structured denoising diffusion models in discrete state-spaces, 2023.
- Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.
- ANARCI: antigen receptor numbering and receptor classification. Bioinformatics, 32(2):298–300, 2016.
- Controllable protein design with language models. Nature Machine Intelligence, 4(6):521–532, 2022.
- Function-guided protein design by deep manifold sampling. bioRxiv, 2021.
- Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- Learning and relearning in Boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition, 1(282-317):2, 1986.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Yet another numbering scheme for immunoglobulin variable domains: an automatic modeling and analysis tool. Journal of molecular biology, 309(3):657–670, 2001.
- Restricted epitope specificity determined by variable region germline segment pairing in rodent antibody repertoires. In MAbs, volume 12, pp. 1722541. Taylor & Francis, 2020.
- Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
- Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
- Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
- Computational and artificial intelligence-based methods for antibody development. Trends in Pharmacological Sciences, 2023.
- Variational diffusion models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
- Conditional antibody design as 3d equivariant graph translation. arXiv preprint arXiv:2208.06073, 2022.
- A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
- Equifold: protein structure prediction with a novel coarse-grained structure representation. bioRxiv, pp. 2022–10, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nature Biomedical Engineering, 5(6):600–612, 2021.
- Concrete score matching: Generalized score matching for discrete data, 2023.
- Koichi Miyasawa. An empirical Bayes estimator of the mean of a normal population. Bulletin of the International Statistical Institute, 38(4):181–188, 1961.
- Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1):141–146, 2022.
- Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
- Herbert Robbins. An empirical Bayes approach to statistics. In Proc. Third Berkeley Symp., volume 1, pp. 157–163, 1956.
- Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology, 10(12):866–876, 2009.
- Langevin dynamics with variable coefficients and nonconservative forces: from stationary states to numerical methods. Entropy, 19(12):647, 2017.
- Neural empirical Bayes. Journal of Machine Learning Research, 20:1–23, 2019.
- Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.
- E(n) equivariant graph neural networks. In International conference on machine learning, pp. 9323–9332. PMLR, 2021.
- A tutorial on conformal prediction. Journal of Machine Learning Research, 9(12):371–421, 2008. URL http://jmlr.org/papers/v9/shafer08a.html.
- Generative language modeling for antibody design. bioRxiv, 2021.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
- A pareto-optimal compositional energy-based model for sampling and optimization of protein sequences. arXiv preprint arXiv:2210.10838, 2022.
- Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
- Criteria of efficiency for conformal prediction. CoRR, abs/1603.04416, 2016. URL http://arxiv.org/abs/1603.04416.
- Convolutions are competitive with transformers for protein sequence pretraining. bioRxiv, 2022. doi: 10.1101/2022.05.19.492714. URL https://www.biorxiv.org/content/early/2022/05/25/2022.05.19.492714.