RITA: a Study on Scaling Up Generative Protein Sequence Models (2205.05789v2)

Published 11 May 2022 in q-bio.QM and cs.LG

Abstract: In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.

Citations (74)

View on Semantic Scholar

Summary

The paper introduces the RITA suite, scaling autoregressive models up to 1.2B parameters to enhance protein sequence modeling.
The paper systematically evaluates scaling effects, revealing a 0.74 exponent in loss reduction and superior performance on enzyme function and zero-shot fitness tasks.
The paper provides open-access models that promote reproducibility and advance research in protein engineering and computational biology.

An Analytical Overview of RITA: Scaling Generative Protein Sequence Models

The paper, "RITA: a Study on Scaling Up Generative Protein Sequence Models," presents a comprehensive analysis of the RITA suite, a collection of autoregressive generative models devised for modeling protein sequences. These models have been scaled considerably, reaching up to 1.2 billion parameters. Utilized datasets include over 280 million protein sequences from the UniRef-100 database. The primary innovation of this research lies in exploring the impact of scaling on model capabilities within the domain of protein sequences. It examines various tasks like next amino acid prediction, zero-shot fitness, and enzyme function prediction. This facilitates a broader understanding of how model size influences performance in protein design applications.

Key Contributions

The paper ensures meticulous methodology by integrating several crucial elements:

Introduction of RITA Models: The research introduces RITA models, with sizes scaling up to 1.2 billion parameters, demonstrating improvement in tasks relevant to protein engineering.
Systematic Evaluation of Scaling: It marks the first systematic paper of scaling in autoregressive transformers within the protein domain, establishing preliminary scaling laws for protein modeling.
Open Access Models: RITA models are published openly, supporting reproducibility and facilitating further exploration by the research community.

Evaluation and Findings

The evaluation spanned multiple datasets to test the models across various contexts. Key results include:

Perplexity Evaluation: RITA models displayed a consistent reduction in perplexity with increasing model size across diverse datasets, demonstrating superior performance compared to existing models like ProtGPT2.
Scaling Behavior: The relationship between model size and protein modeling loss showed an exponent of 0.74, indicating substantial benefits from increased scaling compared to natural language processing contexts.
Downstream Task Performance: On tasks like zero-shot fitness prediction and enzyme function prediction, the larger RITA models outperformed existing baselines, showcasing their potential in predicting protein function and structure.

Discussions and Implications

This research explores the practical implications of deploying large-scale generative models in protein engineering. The promise of speeding up protein design could revolutionize applications in synthetic biology and pharmaceuticals, particularly for developing novel therapeutics and enzymes. By illustrating the dependence of model capabilities on scale, this paper provides a scaffold for future endeavors in scaling and optimally utilizing AI in biochemistry.

Future Perspectives

The paper suggests several avenues for further research:

In-vitro Validation: Experimentation with RITA-generated proteins in laboratory settings could provide insights into the real-world applicability and efficacy of these models in a biological context.
Integration of Structural Information: Enhancing RITA models with structural protein data could refine fixed-backbone protein design, potentially improving model accuracy in predicting complex protein structures.
Further Scaling: Extending the scaling studies beyond the current parameter count could further elucidate the limits and possibilities inherent in generative protein models.

In conclusion, the RITA suite represents a significant step towards understanding and leveraging the potential of large AI models in computational biology. This work aligns theoretical advancements with practical applications, fostering innovation in protein design and molecular biology through accessible AI technologies.

PDF Markdown

Related Papers

GitHub

GitHub - lightonai/RITA: RITA is a family of autoregressive protein models, developed by LightOn in collaboration with the OATML group at Oxford and the Debora Marks Lab at Harvard. (97 stars)