ProGen: Language Modeling for Protein Generation (2004.03497v1)

Published 8 Mar 2020 in q-bio.BM, cs.LG, and stat.ML

Abstract: Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter LLM, ProGen, on ~280M protein sequences conditioned on taxonomic and keyword tags such as molecular function and cellular component. This provides ProGen with an unprecedented range of evolutionary sequence diversity and allows it to generate with fine-grained control as demonstrated by metrics based on primary sequence similarity, secondary structure accuracy, and conformational energy.

Authors (8)

Ali Madani (11 papers)
Bryan McCann (18 papers)
Nikhil Naik (25 papers)
Nitish Shirish Keskar (30 papers)
Namrata Anand (2 papers)
Raphael R. Eguchi (1 paper)
Po-Ssu Huang (1 paper)
Richard Socher (115 papers)

Citations (251)

View on Semantic Scholar

Summary

The paper introduces ProGen—a 1.2-billion parameter conditional language model trained on 280M protein sequences—to generate proteins with targeted attributes.
The paper demonstrates that conditional language modeling enables fine-grained control over protein sequence attributes while preserving structural integrity.
The paper shows zero-shot capabilities in selecting high-fitness protein variants, highlighting its practical potential for accelerating synthetic biology applications.

ProGen: LLMing for Protein Generation

The presented paper introduces ProGen, a LLM specifically designed for the task of protein generation, utilizing techniques from NLP to address challenges in protein engineering. By framing protein engineering as an unsupervised sequence generation problem, the authors leverage the vast and ever-increasing datasets of raw protein sequences which generally lack structural annotations, aiming to predict viable protein structures and functions from these sequences.

Key Contributions

Model Design and Dataset: ProGen is a conditional LLM, trained with a considerable capacity of 1.2 billion parameters. The model was trained on a comprehensive dataset comprising approximately 280 million protein sequences. These sequences were tagged with relevant annotations like molecular function and taxonomic data, enabling the model to harness a wide evolutionary sequence diversity and generate proteins with specific attributes.
Conditional Generation and Performance: The application of conditional LLMing allows ProGen to generate proteins with fine-grained control over sequence attributes. In benchmark evaluations using primary sequence similarity, secondary structure accuracy, and conformational energy metrics, ProGen demonstrates superior performance indicative of its ability to predict mutations that preserve protein structure.
Experimentation and Results: Through evaluation, ProGen exhibited quality performance in generating proteins that aligned closely with native conformational energies, suggesting functional relevance. Notably, completion tasks like that of VEGFR2 kinase domain showcase ProGen's efficiency in maintaining the structural integrity across varied generation lengths.
Zero-shot Applications: The model's capability extends to zero-shot selection of protein variants demonstrating high fitness scores. An experiment with the GB1 protein showed ProGen's proficiency in identifying high-binding affinity sequences among numerous variants without additional fine-tuning.

Theoretical and Practical Implications

The research underscores the potential of large-scale generative models in the bioinformatics domain. ProGen serves not only as a foundational step toward integrating machine learning within protein engineering workflows but also as a complementary tool alongside other protein design methodologies like directed evolution. Practically, the implications are profound for synthetic biology, where such models can expedite the development of new enzymes, therapeutic proteins, or sensor applications by providing viable protein candidates and sequence derivatives without extensive laboratory trial-and-error.

Future Directions

ProGen's approach encourages multiple avenues for further research, including extending the model's architecture and dataset size to enhance its capacity and accuracy further. Future work might focus on integrating structural data more explicitly and refining conditioning techniques to even better tailor protein characteristics. Additionally, exploration of ProGen's capabilities in rational protein design tasks across a broader range of organisms and protein families may be warranted.

In conclusion, ProGen represents a significant step in applying LLM architectures to protein generation, setting a framework for future innovations in this interdisciplinary field at the intersection of NLP, machine learning, and synthetic biology.

PDF Markdown