Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Atom-by-atom protein generation and beyond with language models (2308.09482v1)

Published 16 Aug 2023 in q-bio.BM and cs.LG

Abstract: Protein LLMs learn powerful representations directly from sequences of amino acids. However, they are constrained to generate proteins with only the set of amino acids represented in their vocabulary. In contrast, chemical LLMs learn atom-level representations of smaller molecules that include every atom, bond, and ring. In this work, we show that chemical LLMs can learn atom-level representations of proteins enabling protein generation unconstrained to the standard genetic code and far beyond it. In doing so, we show that LLMs can generate entire proteins atom by atom -- effectively learning the multiple hierarchical layers of molecular information that define proteins from their primary sequence to their secondary, and tertiary structure. We demonstrate LLMs are able to explore beyond protein space -- generating proteins with modified sidechains that form unnatural amino acids. Even further, we find that LLMs can explore chemical space and protein space simultaneously and generate novel examples of protein-drug conjugates. The results demonstrate the potential for biomolecular design at the atom level using LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Daniel Flam-Shepherd (9 papers)
  2. Kevin Zhu (48 papers)
  3. Alán Aspuru-Guzik (227 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.