Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Protein Discovery with Discrete Walk-Jump Sampling (2306.12360v2)

Published 8 Jun 2023 in q-bio.BM and cs.LG

Abstract: We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. What regularized auto-encoders learn from the data-generating distribution. Journal of Machine Learning Research, 15(1):3563–3593, 2014.
  2. Structured denoising diffusion models in discrete state-spaces, 2023.
  3. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, 2009.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  5. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.
  6. ANARCI: antigen receptor numbering and receptor classification. Bioinformatics, 32(2):298–300, 2016.
  7. Controllable protein design with language models. Nature Machine Intelligence, 4(6):521–532, 2022.
  8. Function-guided protein design by deep manifold sampling. bioRxiv, 2021.
  9. Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
  10. Learning and relearning in Boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition, 1(282-317):2, 1986.
  11. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  12. Yet another numbering scheme for immunoglobulin variable domains: an automatic modeling and analysis tool. Journal of molecular biology, 309(3):657–670, 2001.
  13. Restricted epitope specificity determined by variable region germline segment pairing in rodent antibody repertoires. In MAbs, volume 12, pp.  1722541. Taylor & Francis, 2020.
  14. Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
  15. Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
  16. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
  17. Computational and artificial intelligence-based methods for antibody development. Trends in Pharmacological Sciences, 2023.
  18. Variational diffusion models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
  19. Conditional antibody design as 3d equivariant graph translation. arXiv preprint arXiv:2208.06073, 2022.
  20. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
  21. Equifold: protein structure prediction with a novel coarse-grained structure representation. bioRxiv, pp.  2022–10, 2022.
  22. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  23. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nature Biomedical Engineering, 5(6):600–612, 2021.
  24. Concrete score matching: Generalized score matching for discrete data, 2023.
  25. Koichi Miyasawa. An empirical Bayes estimator of the mean of a normal population. Bulletin of the International Statistical Institute, 38(4):181–188, 1961.
  26. Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1):141–146, 2022.
  27. Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
  28. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
  29. Herbert Robbins. An empirical Bayes approach to statistics. In Proc. Third Berkeley Symp., volume 1, pp.  157–163, 1956.
  30. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology, 10(12):866–876, 2009.
  31. Langevin dynamics with variable coefficients and nonconservative forces: from stationary states to numerical methods. Entropy, 19(12):647, 2017.
  32. Neural empirical Bayes. Journal of Machine Learning Research, 20:1–23, 2019.
  33. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.
  34. E(n) equivariant graph neural networks. In International conference on machine learning, pp. 9323–9332. PMLR, 2021.
  35. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(12):371–421, 2008. URL http://jmlr.org/papers/v9/shafer08a.html.
  36. Generative language modeling for antibody design. bioRxiv, 2021.
  37. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
  38. A pareto-optimal compositional energy-based model for sampling and optimization of protein sequences. arXiv preprint arXiv:2210.10838, 2022.
  39. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  40. Criteria of efficiency for conformal prediction. CoRR, abs/1603.04416, 2016. URL http://arxiv.org/abs/1603.04416.
  41. Convolutions are competitive with transformers for protein sequence pretraining. bioRxiv, 2022. doi: 10.1101/2022.05.19.492714. URL https://www.biorxiv.org/content/early/2022/05/25/2022.05.19.492714.
Citations (20)

Summary

  • The paper introduces the dWJS method that decouples energy-based and score-based training for efficient protein sequence generation.
  • It employs Langevin MCMC with a one-step denoising process to enhance sample fidelity and robust exploration of antibody sequences.
  • Experimental results show up to 100% sample expression and 70% exhibiting superior binding, underscoring its impact on therapeutic design.

Insights into Protein Discovery with Discrete Walk-Jump Sampling

The paper "Protein Discovery with Discrete Walk-Jump Sampling" introduces an innovative method for generating discrete protein sequences, particularly antibodies, which are significant in therapeutic domains. The proposed method, Smoothed Discrete Sampling (SDS), aims to enhance the sampling efficiency and robustness of discrete generative models through a refined process of decoupling the learning of energy-based and score-based models. This method, termed Discrete Walk-Jump Sampling (dWJS), suggests significant improvements over existing techniques such as autoregressive and diffusion models.

Core Methodology

The central tenet of this research is the fusion of contrastive divergence-based training of energy-based models (EBMs) with the sample quality advantages attributed to score-based models. The innovation lies in optimizing the training and sampling processes by implementing a single noise level for simplicity and computational efficiency. By utilizing Langevin Markov Chain Monte Carlo (Langevin MCMC) for sampling from smoothed distributions and a one-step denoising process to map back to true data manifolds, the model enhances both the robustness and fidelity of protein generation tasks.

Numerical Results and Key Observations

The authors report remarkable results in antibody protein generation. Notably, 97-100% of generated samples were expressed and purified successfully, while 70% exhibited equal or superior binding affinities in comparison to recognized antibodies. Such results indicate that the dWJS method is adept at navigating the complex antibody sequence space of dimension 20L20^L, where LL is the protein length.

An intriguing claim is the evidence of fast-mixing MCMC chains, indicating that a diverse array of protein classes can be visited within a single chain. This property is vital for exploring the sequence space efficiently, potentially reducing the computational overhead and time required for productive sample generation.

Implications and Speculations

The implications of this research are profound in the context of therapeutic protein design. Practically, it enables more efficient generation of potential antibodies, which can alleviate costs and time in drug discovery pipelines. Theoretically, it advances our understanding of integrating EBMs and score-based models to tackle discrete generation problems effectively.

The potential extension of this approach to other forms of molecules or even discrete data modalities, such as structured text, could transform generative modeling landscapes. It speculatively paves the way for further investigation into reducing noise levels and transitioning these models for structural generation tasks beyond linear sequences.

Conclusion

This paper presents a significant advancement in discrete generative modeling for protein discovery, offering a sophisticated yet computationally efficient approach to generating high-quality, functional protein sequences. The integration of theoretical and empirical insights into the dWJS method positions it as a valuable framework for future exploration and application in artificial intelligence-driven protein engineering and design, with potential cross-disciplinary impacts.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com