Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model (2110.15527v1)

Published 29 Oct 2021 in cs.CL and cs.AI

Abstract: Understanding protein sequences is vital and urgent for biology, healthcare, and medicine. Labeling approaches are expensive yet time-consuming, while the amount of unlabeled data is increasing quite faster than that of the labeled data due to low-cost, high-throughput sequencing methods. In order to extract knowledge from these unlabeled data, representation learning is of significant value for protein-related tasks and has great potential for helping us learn more about protein functions and structures. The key problem in the protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. Instead of leveraging multiple sequence alignment as is usually done, we propose a novel method to capture this information directly by pre-training via a dedicated LLM, i.e., Pairwise Masked LLM (PMLM). In a conventional masked LLM, the masked tokens are modeled by conditioning on the unmasked tokens only, but processed independently to each other. However, our proposed PMLM takes the dependency among masked tokens into consideration, i.e., the probability of a token pair is not equal to the product of the probability of the two tokens. By applying this model, the pre-trained encoder is able to generate a better representation for protein sequences. Our result shows that the proposed method can effectively capture the inter-residue correlations and improves the performance of contact prediction by up to 9% compared to the MLM baseline under the same setting. The proposed model also significantly outperforms the MSA baseline by more than 7% on the TAPE contact prediction benchmark when pre-trained on a subset of the sequence database which the MSA is generated from, revealing the potential of the sequence pre-training method to surpass MSA based methods in general.

Authors (13)

Liang He (202 papers)
Shizhuo Zhang (23 papers)
Lijun Wu (113 papers)
Huanhuan Xia (1 paper)
Fusong Ju (7 papers)
He Zhang (236 papers)
Siyuan Liu (68 papers)
Yingce Xia (53 papers)
Jianwei Zhu (11 papers)
Pan Deng (11 papers)
Bin Shao (61 papers)
Tao Qin (201 papers)
Tie-Yan Liu (242 papers)

Citations (28)

View on Semantic Scholar

Summary

Assessing the Role of Pairwise Masked LLMs in Protein Representation Learning

The paper "Pre-training Co-evolutionary Protein Representation via A Pairwise Masked LLM" introduces an innovative approach to extract meaningful representations from protein sequences utilizing unlabeled data. The focus is on leveraging co-evolutionary information embedded in protein sequences for representation learning, which is an essential aspect for understanding molecular structures and functions.

Core Methodology

Unlike traditional methods that depend heavily on multiple sequence alignment (MSA) for protein representation learning, the authors propose a Pairwise Masked LLM (PMLM) designed to directly capture inter-residue co-evolutionary relationships in protein sequences. This model is an advancement from conventional masked LLMs (MLMs) where the prediction of masked residues is done independently. In PMLM, the dependencies among masked tokens are considered — recognizing that the probability of a pair of amino acids cannot be merely decomposed into the individual probabilities of the constituents. This recognition allows for more accurate modeling of co-evolutionary signals, potentially leading to improved representation of protein sequences.

For training, PMLM leverages a sequence encoder based on the Transformer architecture, which is adept at handling large-scale sequence data. The model employs two prediction heads, one for token prediction and another for pair prediction, facilitating the learning of jointly conditioned pairwise relationships. These representations are critical for downstream tasks such as contact prediction of amino acid residues — a task that benefits from the enriched co-evolutionary information extracted by PMLM.

Performance Evaluation

Experiments demonstrate a substantial improvement over traditional models and MSA baselines. The authors report that the PMLM model increased performance by up to 9% in contact prediction accuracy compared to a standard MLM. Notably, when evaluated on the TAPE benchmark with contact prediction tasks, the PMLM model significantly outperformed MSA-based methods by over 7%. These results underscore the effectiveness of PMLM in capturing inter-residue correlations within protein sequences.

Further comparisons on different datasets, including Pfam and UR50, validate the information extracted by PMLM as robust across diverse sequences. The larger models, such as PMLM-large and PMLM-xl, illustrate that model and dataset expansion positively correlate with enhanced representation learning capabilities.

Implications and Speculative Future Directions

The paper highlights several practical implications of its findings, particularly in areas such as protein structure prediction and bioinformatic applications where sequence representation is crucial. The PMLM framework could enable more efficient exploration and exploitation of the vast and growing repositories of unannotated protein sequences, advancing the fields of drug design, molecular biology, and personalized medicine.

Theoretically, this work suggests exploring novel model architectures that harness multiple residue interactions beyond pairwise associations. Development of triple or higher-order masked LLMs could be the next step, although this would require thoughtful handling of computational efficiency due to the combinatorial increase in pairwise interactions.

Conclusion

In summary, the paper presented in the paper introduces a significant methodological shift in protein sequence representation learning through the Pairwise Masked LLM. By effectively capturing co-evolutionary information via pre-training on pure sequences, the authors offer a pathway towards better understanding and predicting protein functions and interactions. Further research is invited to explore the scalability of such models and their applicability to more complex biological systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/LeoTZ03/status/1832286181142196557