Atom-by-atom protein generation and beyond with language models (2308.09482v1)
Abstract: Protein LLMs learn powerful representations directly from sequences of amino acids. However, they are constrained to generate proteins with only the set of amino acids represented in their vocabulary. In contrast, chemical LLMs learn atom-level representations of smaller molecules that include every atom, bond, and ring. In this work, we show that chemical LLMs can learn atom-level representations of proteins enabling protein generation unconstrained to the standard genetic code and far beyond it. In doing so, we show that LLMs can generate entire proteins atom by atom -- effectively learning the multiple hierarchical layers of molecular information that define proteins from their primary sequence to their secondary, and tertiary structure. We demonstrate LLMs are able to explore beyond protein space -- generating proteins with modified sidechains that form unnatural amino acids. Even further, we find that LLMs can explore chemical space and protein space simultaneously and generate novel examples of protein-drug conjugates. The results demonstrate the potential for biomolecular design at the atom level using LLMs.
- Daniel Flam-Shepherd (9 papers)
- Kevin Zhu (48 papers)
- Alán Aspuru-Guzik (227 papers)