Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks (2012.03084v1)

Published 5 Dec 2020 in q-bio.BM and cs.CL

Abstract: Less than 1% of protein sequences are structurally and functionally annotated. NLP community has recently embraced self-supervised learning as a powerful approach to learn representations from unlabeled text, in large part due to the attention-based context-aware Transformer models. In this work we present a modification to the RoBERTa model by inputting during pre-training a mixture of binding and non-binding protein sequences (from STRING database). However, the sequence pairs have no label to indicate their binding status, as the model relies solely on Masked LLMing (MLM) objective during pre-training. After fine-tuning, such approach surpasses models trained on single protein sequences for protein-protein binding prediction, TCR-epitope binding prediction, cellular-localization and remote homology classification tasks. We suggest that the Transformer's attention mechanism contributes to protein binding site discovery. Furthermore, we compress protein sequences by 64% with the Byte Pair Encoding (BPE) vocabulary consisting of 10K subwords, each around 3-4 amino acids long. Finally, to expand the model input space to even larger proteins and multi-protein assemblies, we pre-train Longformer models that support 2,048 tokens. Further work in token-level classification for secondary structure prediction is needed. Code available at: https://github.com/PaccMann/paccmann_proteomics

Authors (4)

Modestas Filipavicius (4 papers)
Matteo Manica (28 papers)
Joris Cadow (5 papers)
Maria Rodriguez Martinez (24 papers)

Citations (13)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks (2012.03084v1)

Summary

Related Papers