Protein Discovery with Discrete Walk-Jump Sampling (2306.12360v2)

Published 8 Jun 2023 in q-bio.BM and cs.LG

Abstract: We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain.

References (41)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces the dWJS method that decouples energy-based and score-based training for efficient protein sequence generation.
It employs Langevin MCMC with a one-step denoising process to enhance sample fidelity and robust exploration of antibody sequences.
Experimental results show up to 100% sample expression and 70% exhibiting superior binding, underscoring its impact on therapeutic design.

Insights into Protein Discovery with Discrete Walk-Jump Sampling

The paper "Protein Discovery with Discrete Walk-Jump Sampling" introduces an innovative method for generating discrete protein sequences, particularly antibodies, which are significant in therapeutic domains. The proposed method, Smoothed Discrete Sampling (SDS), aims to enhance the sampling efficiency and robustness of discrete generative models through a refined process of decoupling the learning of energy-based and score-based models. This method, termed Discrete Walk-Jump Sampling (dWJS), suggests significant improvements over existing techniques such as autoregressive and diffusion models.

Core Methodology

The central tenet of this research is the fusion of contrastive divergence-based training of energy-based models (EBMs) with the sample quality advantages attributed to score-based models. The innovation lies in optimizing the training and sampling processes by implementing a single noise level for simplicity and computational efficiency. By utilizing Langevin Markov Chain Monte Carlo (Langevin MCMC) for sampling from smoothed distributions and a one-step denoising process to map back to true data manifolds, the model enhances both the robustness and fidelity of protein generation tasks.

Numerical Results and Key Observations

The authors report remarkable results in antibody protein generation. Notably, 97-100% of generated samples were expressed and purified successfully, while 70% exhibited equal or superior binding affinities in comparison to recognized antibodies. Such results indicate that the dWJS method is adept at navigating the complex antibody sequence space of dimension $20^L$ , where $L$ is the protein length.

An intriguing claim is the evidence of fast-mixing MCMC chains, indicating that a diverse array of protein classes can be visited within a single chain. This property is vital for exploring the sequence space efficiently, potentially reducing the computational overhead and time required for productive sample generation.

Implications and Speculations

The implications of this research are profound in the context of therapeutic protein design. Practically, it enables more efficient generation of potential antibodies, which can alleviate costs and time in drug discovery pipelines. Theoretically, it advances our understanding of integrating EBMs and score-based models to tackle discrete generation problems effectively.

The potential extension of this approach to other forms of molecules or even discrete data modalities, such as structured text, could transform generative modeling landscapes. It speculatively paves the way for further investigation into reducing noise levels and transitioning these models for structural generation tasks beyond linear sequences.

Conclusion

This paper presents a significant advancement in discrete generative modeling for protein discovery, offering a sophisticated yet computationally efficient approach to generating high-quality, functional protein sequences. The integration of theoretical and empirical insights into the dWJS method positions it as a valuable framework for future exploration and application in artificial intelligence-driven protein engineering and design, with potential cross-disciplinary impacts.

PDF Markdown

Related Papers

GitHub

GitHub - Genentech/walk-jump: Official repository for discrete Walk-Jump Sampling (dWJS) (53 stars)

Tweets

https://twitter.com/HannesStaerk/status/1755248430786724266

https://twitter.com/MicrobiomDigest/status/1791105210171150704

https://twitter.com/_portal_/status/1792958916718203323

https://twitter.com/tangming2005/status/1792189717653667911

https://twitter.com/Pastel/status/1770006735312330834

https://twitter.com/1687567934816927745/status/1735006281348362667

YouTube

Show All Videos