Dice Question Streamline Icon: https://streamlinehq.com

Leveraging unpaired sequences in all-atom protein co-generation

Develop methods to leverage unpaired protein sequences—sequences without associated structures—within all-atom protein generative models such as Protpardelle and P(all-atom), enabling effective training and generation when only sequence data is available.

Information Square Streamline Icon: https://streamlinehq.com

Background

All-atom co-generation directly models all backbone and side-chain atoms, implicitly co-generating sequence and structure and offering fine-grained control crucial for tasks like enzyme and antibody design. Recent methods include Protpardelle, which diffuses over a fixed atom73 representation and decodes sequence mid-process, and P(all-atom), which uses an atom14 representation and decodes sequence post-generation.

The authors highlight both computational challenges due to high dimensionality and a key data challenge: current all-atom approaches do not provide a clear pathway to exploit the vast corpus of sequences lacking paired structures, limiting training scale and applicability. Addressing how to incorporate such unpaired sequences is an explicit unresolved question.

References

In addition, it is unclear how to leverage sequences without structures in an all-atom model.

Towards deep learning sequence-structure co-generation for protein design (2410.01773 - Wang et al., 2 Oct 2024) in Section 3.4 All-atom co-generation models