ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing (2007.06225v3)

Published 13 Jul 2020 in cs.LG, cs.CL, cs.DC, and stat.ML

Abstract: Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for LLMs taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.

Authors (12)

Ahmed Elnaggar (8 papers)
Michael Heinzinger (1 paper)
Christian Dallago (5 papers)
Ghalia Rihawi (1 paper)
Yu Wang (940 papers)
Llion Jones (16 papers)
Tom Gibbs (13 papers)
Tamas Feher (3 papers)
Christoph Angerer (3 papers)
Martin Steinegger (2 papers)
Debsindhu Bhowmik (8 papers)
Burkhard Rost (5 papers)

Citations (836)

View on Semantic Scholar

Summary

The paper demonstrates that Transformer models applied to protein sequences achieve state-of-the-art secondary structure predictions with Q3 accuracies up to 87%.
The methodology leverages both auto-regressive and auto-encoder models trained on massive protein databases using high-performance computing, reducing reliance on evolutionary information.
The study highlights the cross-domain potential of NLP techniques in biology, enabling scalable protein analysis and opening avenues for drug discovery and protein design.

Overview and Implications of "ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning"

The paper entitled "ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning" by Elnaggar et al. explores the application of Transformer-based models, traditionally utilized in NLP, to protein sequences, substantially advancing the field of computational biology and bioinformatics. This paper addresses the scalability and efficacy of these models in extracting meaningful biophysical and biochemical features from raw protein sequences.

The authors leveraged two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5), training them on significant databases such as UniRef and BFD, containing up to 393 billion amino acids. These models were trained on the Summit supercomputer using 5616 GPUs and TPU Pods with up to 1024 cores, demonstrating the need for high-performance computing (HPC) in handling such vast datasets.

Numerical Results and Claims

The paper validates the embeddings produced by these models through downstream tasks, achieving considerable success. Key numerical results include:

Protein Secondary Structure Prediction: The models reached 3-state secondary structure prediction accuracies (Q3) between 81%-87%, surpassing several state-of-the-art (SOA) methods without relying on evolutionary information.
Protein Localization and Membrane Classification: Achieved 10-state localization accuracy (Q10) of 81% and membrane vs. water-soluble protein accuracy (Q2) of 91%.

The paper highlights ProtT5's capability to outperform SOA methods in several scenarios, achieving the highest Q3 values of 81.4% on CASP12 and 84.8% on NEW364, without using multiple sequence alignments (MSAs) or evolutionary information, which are traditionally considered crucial for such predictions.

Implications for Protein Sequence Analysis

The implications of these findings are far-reaching:

Efficiency and Scalability: The elimination of the need for evolutionary information (obtained through MSAs) significantly reduces computational costs and time. The models can predict structural features using single-protein sequences, enabling large-scale analyses across entire proteomes.
Generalization of Transformer Models: The authors demonstrated that models and methodologies from NLP could be effectively applied to biological sequences, suggesting a broader applicability of transfer learning across domains.
Depth of Learned Representations: The analysis of embeddings and attention mechanisms shows that these models capture nuanced biophysical properties and structural motifs, which can be insightful for understanding protein function and interactions.
Potential for New Applications: The embeddings produced by ProtTrans models can be utilized in various bioinformatics applications, including drug discovery, protein design, and understanding disease-related mutations.

Future Directions

The paper opens several avenues for future research:

Hybrid Approaches: Combining embeddings from ProtTrans models with evolutionary information might enhance prediction accuracy further while maintaining computational efficiency.
Optimization and Customization: Customizing supervised learning pipelines and exploring new architectures like sparse transformers or transformers with improved locality-sensitive hashing could yield better optimized protein models.
Explainability and Interpretation: Enhancing methods for visualizing and interpreting learned embeddings and attention mechanisms could provide deeper insights into the 'language of life' and the functional roles of proteins.
Dataset Refinement and Expansion: Continued improvement and expansion of datasets, coupled with rigorous redundancy reduction protocols, would ensure that models remain up-to-date and are evaluated on the most relevant and challenging datasets.

Overall, this work represents a substantial advancement in computational biology, demonstrating the utility of applying modern NLP techniques to protein sequences and paving the way for more efficient and powerful protein analysis tools.

PDF Markdown

Related Papers

GitHub

GitHub - agemagician/ProtTrans: ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models. (1,099 stars)