Learning protein sequence embeddings using information from structure (1902.08661v2)

Published 22 Feb 2019 in cs.LG, q-bio.BM, and stat.ML

Abstract: Inferring the structural properties of a protein from its amino acid sequence is a challenging yet important problem in biology. Structures are not known for the vast majority of protein sequences, but structure is critical for understanding function. Existing approaches for detecting structural similarity between proteins from sequence are unable to recognize and exploit structural patterns when sequences have diverged too far, limiting our ability to transfer knowledge between structurally related proteins. We newly approach this problem through the lens of representation learning. We introduce a framework that maps any protein sequence to a sequence of vector embeddings --- one per amino acid position --- that encode structural information. We train bidirectional long short-term memory (LSTM) models on protein sequences with a two-part feedback mechanism that incorporates information from (i) global structural similarity between proteins and (ii) pairwise residue contact maps for individual proteins. To enable learning from structural similarity information, we define a novel similarity measure between arbitrary-length sequences of vector embeddings based on a soft symmetric alignment (SSA) between them. Our method is able to learn useful position-specific embeddings despite lacking direct observations of position-level correspondence between sequences. We show empirically that our multi-task framework outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, our goal. Finally, we demonstrate that our learned embeddings can be transferred to other protein sequence problems, improving the state-of-the-art in transmembrane domain prediction.

Authors (2)

Tristan Bepler (10 papers)
Bonnie Berger (29 papers)

Citations (267)

View on Semantic Scholar

Summary

Learning Protein Sequence Embeddings via Structural Information

The paper introduces an advanced framework for embedding protein sequences using structural information—a critical but often missing piece in understanding protein functionality. Addressing the challenge of inferring protein structures from mere sequences, the authors propose a bidirectional LSTM-based model that creates embeddings rich in structural patterns. These embeddings are valuable for predicting structural similarities between diverse protein sequences and can be generalized to other bioinformatics tasks such as transmembrane domain prediction.

Core Contributions

The authors concentrate on leveraging structural information to inform sequence embeddings, an area with significant potential given the disjunction between sequence and structural similarities. Though structures are pivotal for functional understanding, experimental determination for every protein is impractical. Existing computational approaches falter when sequence divergence occurs, thus hindering the identification of structurally similar proteins. Here, the proposed framework attempts to close this gap by introducing a model that allows sequences to be mapped into embedding spaces where residue-level structural information is encoded.

The approach employs a biLSTM model trained using a unique two-part mechanism: leveraging global structural similarities and residue contact maps within proteins. A novel soft symmetric alignment (SSA) method is presented, enabling comparison of vector embeddings of varying lengths, enhancing the utility of these embeddings in transferring structural knowledge across proteins.

Numerical Results and Methodological Insights

Empirically, the framework outperforms other sequence-based methods and even surpasses top-performing structure-based approaches like TMalign in predicting SCOP hierarchical similarities. Notably, in structural similarity classification within the SCOPE ASTRAL dataset, the model significantly boosts prediction accuracy and correlation with structural similarity levels.

Besides, ablation studies revealed the pivotal nature of contact prediction and LLM components in improving embedding quality. Including these elements fortified the framework's capacity to predict both global structural similarity and local secondary structures in proteins.

The framework's embeddings show utility beyond similarity prediction; by enhancing a transmembrane prediction model, the embeddings improve the state-of-the-art results, validating their broader applicability.

Implications and Future Directions

This work demonstrates a significant leap in understanding protein structures from sequences via learned embeddings. By integrating SSA and multitask learning, there is an opportunity to bridge the existing chasm between sequence alignment and structural homology, making these embeddings particularly valuable for tasks demanding sequence-to-structure mappings.

Future directions could explore optimizing these embeddings for multi-domain proteins or expanding use cases to other domains like active site prediction or protein-protein interaction analyses. The model's flexibility also suggests potential applications in non-biological sequence modeling tasks, broadening its impact in the field of representation learning.

In summary, this framework provides a robust solution for predicting protein structures from sequences, aiding functional inference and enabling cross-domain predictive tasks with higher accuracy. It is a remarkable step in applying advanced AI to real-world biological challenges, exploiting the synergy of structural and sequence information to enhance understanding and prediction capabilities in proteomics.

PDF Markdown

Related Papers

Tweets

https://twitter.com/miangoar/status/1848916575585259747